10.1007@978 3 030 35514 2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 462

Nahla Ben Amor

Benjamin Quost
Martin Theobald (Eds.)
LNAI 11940

Scalable Uncertainty
Management
13th International Conference, SUM 2019
Compiègne, France, December 16–18, 2019
Proceedings

123
Lecture Notes in Artificial Intelligence 11940

Subseries of Lecture Notes in Computer Science

Series Editors
Randy Goebel
University of Alberta, Edmonton, Canada
Yuzuru Tanaka
Hokkaido University, Sapporo, Japan
Wolfgang Wahlster
DFKI and Saarland University, Saarbrücken, Germany

Founding Editor
Jörg Siekmann
DFKI and Saarland University, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/1244
Nahla Ben Amor Benjamin Quost
• •

Martin Theobald (Eds.)

Scalable Uncertainty
Management
13th International Conference, SUM 2019
Compiègne, France, December 16–18, 2019
Proceedings

123
Editors
Nahla Ben Amor Benjamin Quost
Institut Supérieur de Gestion de Tunis University of Technology of Compiègne
Bouchoucha, Tunisia Compiègne, France
Martin Theobald
University of Luxembourg
Esch-Sur-Alzette, Luxembourg

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Artificial Intelligence
ISBN 978-3-030-35513-5 ISBN 978-3-030-35514-2 (eBook)
https://doi.org/10.1007/978-3-030-35514-2
LNCS Sublibrary: SL7 – Artificial Intelligence

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

These are the proceedings of the 13th International Conference on Scalable Uncertainty
Management (SUM 2019) held during December 16–18, 2019, in Compiègne, France.
The SUM conferences are annual events which gather researchers interested in the
management of imperfect information from a wide range of fields, such as artificial
intelligence, databases, information retrieval, machine learning, and risk analysis, and
with the aim of fostering the collaboration and cross-fertilization of ideas from different
communities.
The first SUM conference was held in Washington DC in 2007. Since then, the
SUM conferences have successively taken place in Napoli in 2008, Washington DC in
2009, Toulouse in 2010, Dayton in 2011, Marburg in 2012, Washington DC in 2013,
Oxford in 2014, Québec in 2015, Nice in 2016, Granada in 2017, and Milano in 2018.
The 25 full, 4 short, 4 tutorial, 2 invited keynote papers gathered in this volume
were selected from an overall amount of 44 submissions (5 of which were desk-rejected
or withdrawn by the authors), after a rigorous peer-review process by at least 3 Pro-
gram Committee members. In addition to the regular presentations, the technical
program of SUM 2019 also included invited lectures by three outstanding researchers:
Cassio P. de Campos (Eindhoven University of Technology, The Netherlands) on
“Scalable Reliable Machine Learning Using Sum-Product Networks,” Jérôme Lang
(CNRS, Paris, France) on “Computational Social Choice,” and Wolfgang Gatterbauer
(Northeastern University, Boston, USA) on “Algebraic approximations of the Proba-
bility of Boolean Functions.”
An originality of the SUM conferences is the care for dedicating a large space
of their programs to invited tutorials about a wide range of topics related to uncertainty
management, to further embrace the aim of facilitating interdisciplinary collaboration
and cross-fertilization of ideas. This edition includes five tutorials, for which we thank
Christophe Gonzales, Thierry Denœux, Marie-Jeanne Lesot, Maximilian Schleich, and
the Kay R. Amel working group for preparing and presenting these tutorials (four
of these tutorials have a companion paper included in this volume).
We would like to thank all of the authors, invited speakers, and tutorial speakers for
their valuable contributions. We in particular also express our gratitude to the members
of the Program Committee as well as to the external reviewers for their constructive
comments on the submissions. We would like to extend our appreciation to all par-
ticipants of SUM 2019 for their great contribution and the success of the conference.
We are grateful to the Steering Committee for their suggestions and support, and to the
Organization Committee for their support in the organization for the great work
accomplished. We are also very grateful to the Université de Technologie de
Compiègne (UTC) for hosting the conference, to the Heudiasyc laboratory and the
vi Preface

MS2T laboratory of excellence for their financial and technical support, and to Springer
for sponsoring the Best Paper Award as well as for the ongoing support of its staff in
publishing this volume.

December 2019 Nahla Ben Amor


Benjamin Quost
Martin Theobald
Organization

General Chair
Benjamin Quost Université de Technologie de Compiègne, France

Program Committee Chairs


Nahla Ben Amor LARODEC - Institut Supérieur de Gestion Tunis,
Tunisia
Martin Theobald University of Luxembourg, Luxembourg

Steering Committee
Didier Dubois IRIT-CNRS, France
Lluis Godo IIIA-CSIC, Spain
Eyke Hüllermeier Universität Paderborn, Germany
Anthony Hunter University College London, UK
Henri Prade IRIT-CNRS, France
Steven Schockaert Cardiff University, UK
V. S. Subrahmanian University of Maryland, USA

Program Committee
Nahla Ben Amor (PC Chair) Institut Supérieur de Gestion de Tunis and LARODEC,
Tunisia
Martin Theobald (PC Chair) University of Luxembourg, Luxembourg
Sébastien Destercke CNRS, Heudiasyc, France
Henri Prade CNRS-IRIT, France
John Grant Towson University, USA
Leila Amgoud CNRS-IRIT, France
Benjamin Quost Université de Technologie de Compiègne, Heudiasyc,
France
Thomas Lukasiewicz University of Oxford, UK
Pierre Senellart DI, École Normale Supérieure, Université PSL, France
Francesco Parisi DIMES, University of Calabria, Italy
Davide Ciucci Università di Milano-Bicocca, Italy
Fernando Bobillo University of Zaragoza, Spain
Salem Benferhat UMR CNRS 8188, Université d’Artois, France
Silviu Maniu Université Paris-Sud, France
viii Organization

Rafael Peñaloza University of Milano-Bicocca, Italy


Fabio Cozman University of São Paulo, Brazil
Umberto Straccia ISTI-CNR, Italy
Lluis Godo Artificial Intelligence Research Institute, IIIA-CSIC,
Spain
Philippe Leray LINA/DUKe, Université de Nantes, France
Zied Elouedi Institut Supérieur de Gestion de Tunis, Tunisia
Olivier Pivert IRISA Laboratory, ENSSAT, France
Didier Dubois CNRS-IRIT, France
Olivier Colot Université Lille I, France
Leopoldo Bertossi Relational AI Inc. and Carleton University, Canada
Manuel Gómez-Olmedo University of Granada, Spain
Andrea Pugliese University of Calabria, Italy
Alessandro Antonucci IDSIA, Switzerland
Maurice van Keulen University of Twente, The Netherlands
Thierry Denœux Université de Technologie de Compiègne, France
Sebastian Link University of Auckland, New Zealand
Christoph Beierle FernUniversität Hagen, Germany
Cassio De Campos Utrecht University, The Netherlands
Andrea Tettamanzi Université de Nice-Sophia-Antipolis, France
Rainer Gemulla Universität Mannheim, Germany
Daniel Deutch Tel Aviv University, Israel
Raouia Ayachi LARODEC, Institut Supérieur de Gestion de Tunis,
Tunisia
Imen Boukhris LARODEC, Institut Supérieur de Gestion de Tunis,
Tunisia

Organization Committee
Yonatan Carlos Carranza Université de Technologie de Compiègne, France
Alarcon
Sébastien Destercke CNRS, Université de Technologie de Compiègne,
France
Marie-Hélène Masson Université de Picardie Jules Verne, France
Benjamin Quost Université de Technologie de Compiègne, France
(General Chair)
David Savourey Université de Technologie de Compiègne, France
Contents

An Experimental Study on the Behaviour of Inconsistency Measures. . . . . . . 1


Matthias Thimm

Inconsistency Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Matthias Thimm

Using Graph Convolutional Networks for Approximate Reasoning


with Abstract Argumentation Frameworks: A Feasibility Study . . . . . . . . . . . 24
Isabelle Kuhlmann and Matthias Thimm

The Hidden Elegance of Causal Interaction Models . . . . . . . . . . . . . . . . . . . 38


Silja Renooij and Linda C. van der Gaag

Computational Models for Cumulative Prospect Theory: Application


to the Knapsack Problem Under Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Hugo Martin and Patrice Perny

On a New Evidential C-Means Algorithm with Instance-Level Constraints . . . 66


Jiarui Xie and Violaine Antoine

Hybrid Reasoning on a Bipolar Argumentation Framework . . . . . . . . . . . . . 79


Tatsuki Kawasaki, Sosuke Moriguchi, and Kazuko Takahashi

Active Preference Elicitation by Bayesian Updating


on Optimality Polyhedra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Nadjet Bourdache, Patrice Perny, and Olivier Spanjaard

Selecting Relevant Association Rules From Imperfect Data . . . . . . . . . . . . . 107


Cécile L’Héritier, Sébastien Harispe, Abdelhak Imoussaten,
Gilles Dusserre, and Benoît Roig

Evidential Classification of Incomplete Data via Imprecise Relabelling:


Application to Plastic Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Lucie Jacquin, Abdelhak Imoussaten, François Trousset,
Jacky Montmain, and Didier Perrin

An Analogical Interpolation Method for Enlarging a Training Dataset . . . . . . 136


Myriam Bounhas and Henri Prade

Towards a Reconciliation Between Reasoning and Learning -


A Position Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Didier Dubois and Henri Prade
x Contents

CP-Nets, p-pref Nets, and Pareto Dominance . . . . . . . . . . . . . . . . . . . . . . . 169


Nic Wilson, Didier Dubois, and Henri Prade

Measuring Inconsistency Through Subformula Forgetting. . . . . . . . . . . . . . . 184


Yakoub Salhi

Explaining Hierarchical Multi-linear Models . . . . . . . . . . . . . . . . . . . . . . . . 192


Christophe Labreuche

Assertional Removed Sets Merging of DL-Lite Knowledge Bases . . . . . . . . . 207


Salem Benferhat, Zied Bouraoui, Odile Papini, and Eric Würbel

An Interactive Polyhedral Approach for Multi-objective Combinatorial


Optimization with Incomplete Preference Information . . . . . . . . . . . . . . . . . 221
Nawal Benabbou and Thibaut Lust

Open-Mindedness of Gradual Argumentation Semantics. . . . . . . . . . . . . . . . 236


Nico Potyka

Approximate Querying on Property Graphs . . . . . . . . . . . . . . . . . . . . . . . . 250


Stefania Dumbrava, Angela Bonifati, Amaia Nazabal Ruiz Diaz,
and Romain Vuillemot

Learning from Imprecise Data: Adjustments of Optimistic


and Pessimistic Variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Eyke Hüllermeier, Sébastien Destercke, and Ines Couso

On Cautiousness and Expressiveness in Interval-Valued Logic . . . . . . . . . . . 280


Sébastien Destercke and Sylvain Lagrue

Preference Elicitation with Uncertainty: Extending Regret Based


Methods with Belief Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Pierre-Louis Guillot and Sebastien Destercke

Evidence Propagation and Consensus Formation in Noisy Environments . . . . 310


Michael Crosscombe, Jonathan Lawry, and Palina Bartashevich

Order-Independent Structure Learning of Multivariate Regression


Chain Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Mohammad Ali Javidian, Marco Valtorta, and Pooyan Jamshidi

Comparison of Analogy-Based Methods for Predicting Preferences . . . . . . . . 339


Myriam Bounhas, Marc Pirlot, Henri Prade, and Olivier Sobrie

Using Convolutional Neural Network in Cross-Domain Argumentation


Mining Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Rihab Bouslama, Raouia Ayachi, and Nahla Ben Amor
Contents xi

ConvNet and Dempster-Shafer Theory for Object Recognition . . . . . . . . . . . 368


Zheng Tong, Philippe Xu, and Thierry Denœux

On Learning Evidential Contextual Corrections from Soft Labels Using


a Measure of Discrepancy Between Contour Functions . . . . . . . . . . . . . . . . 382
Siti Mutmainah, Samir Hachour, Frédéric Pichon, and David Mercier

Efficient Möbius Transformations and Their Applications to D-S Theory . . . . 390


Maxime Chaveroche, Franck Davoine, and Véronique Cherfaoui

Dealing with Continuous Variables in Graphical Models . . . . . . . . . . . . . . . 404


Christophe Gonzales

Towards Scalable and Robust Sum-Product Networks . . . . . . . . . . . . . . . . . 409


Alvaro H. C. Correia and Cassio P. de Campos

Learning Models over Relational Data: A Brief Tutorial . . . . . . . . . . . . . . . 423


Maximilian Schleich, Dan Olteanu, Mahmoud Abo-Khamis,
Hung Q. Ngo, and XuanLong Nguyen

Subspace Clustering and Some Soft Variants . . . . . . . . . . . . . . . . . . . . . . . 433


Marie-Jeanne Lesot

Invited Keynotes

From Shallow to Deep Interactions Between Knowledge Representation,


Reasoning and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Kay R. Amel

Algebraic Approximations for Weighted Model Counting . . . . . . . . . . . . . . 449


Wolfgang Gatterbauer

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451


An Experimental Study on the Behaviour
of Inconsistency Measures

Matthias Thimm(B)

University of Koblenz-Landau, Koblenz, Germany


[email protected]

Abstract. We apply a selection of 19 inconsistency measures from the


literature on artificially generated knowledge bases and study the dis-
tribution of their values and their pairwise correlation. This study aug-
ments previous analytical evaluations on the expressivity and the pair-
wise incompatibility of these measures and our findings show that (1)
many measures assign only few distinct values to many different knowl-
edge bases, and (2) many measures, although founded on different theo-
retical concepts, correlate significantly.

1 Introduction
An inconsistency measure I is a function that assigns to a knowledge base K
(usually assumed to be formalised in propositional logic) a non-negative real
value I(K) such that I(K) = 0 iff K is consistent and larger values of I(K)
indicate “larger” inconsistency in K [3,5,12]. Thus, each inconsistency measure
I formalises a notion of a degree of inconsistency and a lot of different concrete
approaches have been proposed so far, see [11–13] for some surveys. The quest
for the “right” way to measure inconsistency is still ongoing and many (usually
controversial) rationality postulates to describe the desirable behaviour of an
inconsistency measure have been proposed so far [2,12].
Our study aims at providing a new perspective on the analysis of exist-
ing approaches to inconsistency measurement by experimentally analysing the
behaviour of inconsistency measures. More precisely, our study provides a quan-
titative analysis of two aspects of inconsistency measures:
A1 the distribution of inconsistency values on actual knowledge bases, and
A2 the correlation of different inconsistency measures.
Regarding the first item, [11] investigated the theoretical expressivity of incon-
sistency measures, i. e., the number of different inconsistency values a measure
attains when some dimension of the knowledge base is bounded (such as the
number of formulas or the size of the signature). One result in [11] is that e. g.
the measure Idalal
Σ
(see Sect. 3) has maximal expressivity and the number of dif-
ferent inconsistency values is not bounded if only one of these two dimensions
is bounded. However, [11] does not investigate the distribution of inconsistency
values. It may be the case that, although a measure can attain many different
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 1–8, 2019.
https://doi.org/10.1007/978-3-030-35514-2_1
2 M. Thimm

values, most inconsistent knowledge bases are clustered on very few inconsistency
values. Regarding the second item, previous works have shown—see [12] for an
overview—that all inconsistency measures developed so far are “essentially” dif-
ferent. More precisely, for each pair of measures one can find a property that is
satisfied by one measure but not by the other. Moreover, for each pair of incon-
sistency measures one can find knowledge bases that are ordered different wrt.
their inconsistency. However, until now it has not been investigated how “sig-
nificant” the difference between measures actually is. It may be the case that
two measures order all but just a very few knowledge bases differently (or the
other way around). In order to analyse these two aspects we applied 19 different
inconsistency measures from the literature on artificially generated knowledge
bases and performed a statistical analysis on the results. After a brief review of
necessary preliminaries in Sect. 2 and the considered inconsistency measures in
Sect. 3, we provide some details on our experiments and our findings in Sect. 4
and conclude in Sect. 5.

2 Preliminaries
Let At be some fixed propositional signature, i. e., a (possibly infinite) set of
propositions, and let L(At) be the corresponding propositional language con-
structed using the usual connectives ∧ (and ), ∨ (or ), and ¬ (negation).
Definition 1. A knowledge base K is a finite set of formulas K ⊆ L(At). Let K
be the set of all knowledge bases.
If X is a formula or a set of formulas we write At(X) to denote the set of
propositions appearing in X. Semantics to a propositional language is given by
interpretations and an interpretation ω on At is a function ω : At → {true, false}.
Let Ω(At) denote the set of all interpretations for At. An interpretation ω satisfies
(or is a model of) an atom a ∈ At, denoted by ω |= a, if and only if ω(a) = true.
The satisfaction relation |= is extended to formulas in the usual way.
For Φ ⊆ L(At) we also define ω |= Φ if and only if ω |= φ for every φ ∈ Φ.
Define furthermore the set of models Mod(X) = {ω ∈ Ω(At) | ω |= X} for every
formula or set of formulas X. By abusing notation, a formula or set of formulas
X1 entails another formula or set of formulas X2 , denoted by X1 |= X2 , if
Mod(X1 ) ⊆ Mod(X2 ). Two formulas or sets of formulas X1 , X2 are equivalent,
denoted by X1 ≡ X2 , if Mod(X1 ) = Mod(X2 ). If Mod(X) = ∅ we also write
X |=⊥ and say that X is inconsistent.

3 Inconsistency Measures

Let R∞≥0 be the set of non-negative real values including ∞. Inconsistency mea-
sures are functions I : K → R∞ ≥0 that aim at assessing the severity of the
inconsistency in a knowledge base K. The basic idea is that the larger the incon-
sistency in K the larger the value I(K). We refer to [11–13] for surveys.
An Experimental Study on the Behaviour of Inconsistency Measures 3

Fig. 1. Definitions of the considered measures

The formal definitions of the considered inconsistency measures can be found


in Fig. 1 while the necessary notation for understanding these measures follows
below. Please see the above-mentioned surveys and the original papers referenced
therein for explanations and examples.
A set M ⊆ K is called minimal inconsistent subset (MI) of K if M |=⊥
and there is no M  ⊂ M with M  |=⊥. Let MI(K) be the set of all MIs of
K. Let furthermore MC(K) be the set of maximal consistent subsets of K, i. e.,
MC(K) = {K ⊆ K | K |=⊥ ∧∀K  K : K |=⊥}, and let SC(K) be the set of
self-contradictory formulas of K, i. e., SC(K) = {φ ∈ K | φ |=⊥}.
 A probability function P is of the form P : Ω(At) → [0, 1] with
ω∈Ω(At) P (ω) = 1. Let P(At) be the set of all those probability functions and
function P ∈ P(At) define the probability of an arbitrary
for a given probability 
formula φ via P (φ) = ω|=φ P (ω).
4 M. Thimm

A three-valued interpretation υ on At is a function υ : At → {T, F, B} where


the values T and F correspond to the classical true and false, respectively. The
additional truth value B stands for both and is meant to represent a conflicting
truth value for a proposition. Taking into account the truth order ≺ defined
via T ≺ B ≺ F , an interpretation υ is extended to arbitrary formulas via
υ(φ1 ∧φ2 ) = min≺ (υ(φ1 ), υ(φ2 )), υ(φ1 ∨φ2 ) = max≺ (υ(φ1 ), υ(φ2 )), and υ(¬T ) =
F , υ(¬F ) = T , υ(¬B) = B. An interpretation υ satisfies a formula α, denoted
by υ |=3 α if either υ(α) = T or υ(α) = B.
The Dalal distance dd is a distance function for interpretations in Ω(At) and
is defined as d(ω, ω  ) = |{a ∈ At | ω(a) = ω  (a)}| for all ω, ω  ∈ Ω(At). If
X ⊆ Ω(At) is a set of interpretations we define dd (X, ω) = minω ∈X dd (ω  , ω)
(if X = ∅ we define dd (X, ω) = ∞). We consider the inconsistency measures
Idalal
Σ
, Idalal
max
, and Idalal
hit
from [4] but only for the Dalal distance. Note that in [4]
these measures were considered for arbitrary distances and that we use a slightly
different but equivalent definition of these measures.
For every knowledge base K, i = 1, . . . , |K| define MI(i) (K) = {M ∈
MI(K) | |M | = i} and CN(i) (K) = {C ⊆ K | |C| = i ∧ C |=⊥}. Fur-
thermore define Ri (K) = 0 if |MI(i) (K)| + |CN(i) (K)| = 0 and otherwise
Ri (K) = |MI(i) (K)|/(|MI(i) (K)| + |CN(i) (K)|). Note that the definition of IDf
in Table 1 is only one instance of the family studied in [9], other variants can be
obtained by different ways of aggregating the values Ri (K).
 A set of maximal consistent subsets C ⊆ MC(K) is called an MC-cover [1] if
C∈C C = K. An MC-cover C is normal if noproper subset of C is an MC-cover.
A normal MC-cover is maximal if λ(C) = | C∈C C| is maximal for all normal
MC-covers.
For a formula φ let φ[a1 , i1 → ψ1 ; . . . , ak , ik → ψk ] denote the formula φ
where the ij th occurrence of the proposition aj is replaced by the formula ψj ,
for all j = 1, . . . , k.
A set {K1 , . . . , Kn } of pairwise disjoint subsets of K is called a conditional
independent MUS (CI) partition of K [6], iff each Ki is inconsistent and MI(K1 ∪
. . . ∪ Kn ) is the disjoint union of all MI(Ki ).
An ordered set P = {P1 , . . . , Pn } with Pi ⊆ MI(K) for i = 1, . . . , n is called
an ordered CSP-partition [7] of MI(K) if 1.) MI(K) is the disjoint union of all
Pi for i = 1, . . . , n, 2.) each Pi is a conditional independent MUS partition of
K for i = 1, . . . , n, and 3.) |Pi | ≥ |Pi+1 | for i = 1, . . . , n − 1. For such P define
n
furthermore W(P) = i=1 |Pi |/i.

4 Experiments
In the following, we give some details on our experiments, the evaluation method-
ology, and our findings.

4.1 Knowledge Base Generation


Due to the lack of a dataset of real-world knowledge bases with a significantly rich
profile of inconsistencies, we used artificially generated knowledge bases. In order
An Experimental Study on the Behaviour of Inconsistency Measures 5

to avoid biasing our study on random instances of a specific probabilistic model


for knowledge base generation, we developed an algorithm that enumerates all
syntactically different knowledge bases with increasing size and considered the
first 188900 bases generated this way. For example, the first five knowledge bases
generated this way are ∅, {x1 }, {¬x1 }, {¬¬x1 }, {x1 , x2 } and, e. g., number 72793
is {x1 , x2 , ¬x2 , ¬(¬x2 ∧ ¬¬x2 )}. From the 188900 generated knowledge bases,
127814 are consistent and 61086 are inconsistent. For the remainder of this paper,
let K̂ denote the set of all 188900 knowledge bases and let K̂ ⊥ ⊆ K̂ be only the
inconsistent ones.
The implementation1 for this algorithm is available in the Tweety project2
[10]. The generated knowledge bases and their inconsistency values wrt. each of
considered inconsistency measures are available online3 .

4.2 Evaluation Measures


In order to evaluate A1, we apply the entropy on the distribution of inconsistency
values of each measure. For K ⊆ K let I(K) = {I(K) | K ∈ K} denote the image
of K wrt. I.
Definition 2. Let K be a set of knowledge bases and I be an inconsistency
measure. The entropy HK (I) of I wrt. K is defined via
 |I −1 (x)| |I −1 (x)|
HK (I) = − ln
|K| |K|
x∈I(K)

where ln x denotes the natural logarithm with 0 ln 0 = 0.


For example, if a measure I ∗ assigns to a set K ∗ of 10 knowledge bases 5 times
the value X, 3 times the value Y , and 2 times the value Z, we have
5 5 3 3 2 2
HK ∗ (I ∗ ) = − ln − ln − ln ≈ 1.03
10 10 10 10 10 10
The interpretation behind the entropy here is that a larger value HK (I) indicates
a more uniform distribution of the inconsistency values on elements of K, a
value HK (I) = 0 indicates that all elements are assigned the same inconsistency
value. Thus, the larger HK (I) the “more use” the measure makes of its available
inconsistency values.
In order to evaluate A2, we use a specific notion of a correlation coefficient.
For two measures I1 and I2 and two knowledge bases K1 and K2 we say that I1
and I2 are order-compatible wrt. K1 and K2 , denoted by I1 ∼K1 ,K2 I2 iff
I1 (K1 ) > I1 (K2 ) ∧ I2 (K1 ) > I2 (K2 )
or I1 (K1 ) < I1 (K2 ) ∧ I2 (K1 ) < I2 (K2 )
or I1 (K1 ) = I1 (K2 ) ∧ I2 (K1 ) = I2 (K2 )
1
http://mthimm.de/r/?r=tweety-ckb.
2
http://tweetyproject.org.
3
http://mthimm.de/misc/exim mt.zip.
6 M. Thimm

Table 1. Entropy values of the investigated measures wrt. K̂ ⊥ (rounded to two deci-
mals and sorted by increasing entropy).

Id ICC Idalal
hit
Ic Imc Iforget IMI Iis Idalal
max
ICSP
HK̂ ⊥ (I) 0 0.08 0.09 0.12 0.13 0.18 0.24 0.28 0.29 0.29
Ihs Iη Idalal
Σ
IMIC Imv Imcsc Ip Inc ID f
HK̂ ⊥ (I) 0.29 0.33 0.36 0.37 0.45 0.48 0.51 0.52 0.78

Table 2. Correlation coefficients CK̂ ⊥ (·, ·) of the investigated measures wrt. K̂ ⊥


(rounded to two decimals).

Σ max hit
Id IMI IMIC Iη Ic Imc IpIhs Idalal Idalal Idalal ID Imv Inc Imcsc ICSP Iforget ICC Iis
f
Id 1 0.69 0.44 0.5 0.86 0.87 0.35 0.52 0.47 0.52 0.9 0.22 0.48 0.33 0.37 0.68 0.76 0.92 0.67
IMI 1 0.54 0.37 0.72 0.74 0.65 0.38 0.41 0.38 0.76 0.28 0.41 0.47 0.52 0.99 0.7 0.75 0.99
IMIC 1 0.72 0.47 0.51 0.53 0.7 0.73 0.7 0.52 0.49 0.41 0.43 0.84 0.55 0.51 0.5 0.55
Iη 1 0.47 0.48 0.36 0.98 0.93 0.98 0.49 0.53 0.39 0.33 0.84 0.37 0.48 0.5 0.37
Ic 1 0.85 0.4 0.49 0.53 0.49 0.88 0.25 0.45 0.38 0.42 0.72 0.88 0.87 0.72
Imc 1 0.45 0.48 0.48 0.48 0.95 0.26 0.45 0.39 0.39 0.75 0.8 0.94 0.75
Ip 1 0.36 0.39 0.36 0.43 0.25 0.32 0.43 0.5 0.64 0.42 0.41 0.64
Ihs 1 0.95 0.99 0.51 0.52 0.4 0.32 0.85 0.38 0.5 0.52 0.38
Σ
Idalal 1 0.95 0.51 0.53 0.4 0.34 0.89 0.42 0.54 0.5 0.42
max
Idalal 1 0.5 0.52 0.4 0.32 0.85 0.38 0.5 0.52 0.38
hit
Idalal 1 0.26 0.46 0.4 0.41 0.77 0.85 0.98 0.77
ID 1 0.53 0.19 0.56 0.29 0.29 0.26 0.29
f
Imv 1 0.25 0.39 0.41 0.43 0.46 0.41
Inc 1 0.39 0.47 0.4 0.39 0.47
Imcsc 1 0.53 0.44 0.4 0.53
ICSP 1 0.71 0.76 0.99
Iforget 1 0.82 0.71
ICC 1 0.76
Iis 1

Let A be the indicator function, which is defined as A = 1 iff A is true and
A = 0 otherwise.
Definition 3. Let K be a set of knowledge bases and I1 , I2 be two inconsistency
measures. The correlation coefficient CK (I1 , I2 ) of I1 and I2 wrt. K is defined
via

K,K ∈K,K
=K I1 ∼K,K I2 
CK (I1 , I2 ) =
|K|(|K| − 1)

In other words, CK (I1 , I2 ) gives the ratio of how much I1 and I2 agree on
the inconsistency order of any pair of knowledge bases from K.4 Observe that
CK (I1 , I2 ) = CK (I2 , I1 ).

4
Note that CK is equivalent to the Kendall’s tau coefficient [8] but scaled onto [0, 1].
An Experimental Study on the Behaviour of Inconsistency Measures 7

4.3 Results
Tables 1 and 2 show the results of analysing the considered measures on K̂ ⊥ wrt.
the two evaluation measures from before5 .
Regarding A1, it can be seen that Id has minimal entropy (by definition).
However, also measures Idalal hit
and ICC and to some extent most of the other
measures are quite indifferent in assigning their values. For example, out of 61086
inconsistent knowledge bases, ICC assigns to 58523 of them the same value 1.
On the other hand, measure IDf has maximal entropy among the considered
measures.
Regarding A2, we can observe some surprising correlations between mea-
sures, even those which are based on different concepts. For example, we have
max
CK̂ ⊥ (Idalal , Ihs ) ≈ 0.99 indicating a high correlation between Idalal
max
and Ihs
although Idalal is defined using distances and Ihs is defined using hitting sets.
max

Equally high correlations can be observed between the three measures IMI , ICSP ,
and Iis . Further high correlations (e. g. above 0.8) can be observed between many
other measures. On the other hand, the measure IDf has (on average) the small-
est correlation to all other measures, backing up the observation from before.

5 Conclusion
Our experimental analysis showed that many existing measures have low entropy
on the distribution of inconsistency values and correlate significantly in their
ranking of inconsistent knowledge bases. A web application for trying out all
the discussed inconsistency measures can be found on the website of Tweety-
Project6 , cf. [10]. Most of these measures have been implemented using naive
algorithms and research on the algorithmic issues of inconsistency measure is
still desirable future work, see also [13].

Acknowledgements. The research reported here was partially supported by the


Deutsche Forschungsgemeinschaft (grant DE 1983/9-1).

References
1. Ammoura, M., Raddaoui, B., Salhi, Y., Oukacha, B.: On measuring inconsistency
using maximal consistent sets. In: Destercke, S., Denoeux, T. (eds.) ECSQARU
2015. LNCS (LNAI), vol. 9161, pp. 267–276. Springer, Cham (2015). https://doi.
org/10.1007/978-3-319-20807-7 24
2. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
3. Grant, J., Hunter, A.: Measuring inconsistency in Knowledge bases. J. Intell. Inf.
Syst. 27, 159–184 (2006)
5
We only considered the inconsistent knowledge bases from K̂ as all measures assign
degree 0 to the consistent ones anyway.
6
http://tweetyproject.org/w/incmes/.
8 M. Thimm

4. Grant, J., Hunter, A.: Distance-based measures of inconsistency. In: van der Gaag,
L.C. (ed.) ECSQARU 2013. LNCS (LNAI), vol. 7958, pp. 230–241. Springer,
Heidelberg (2013). https://doi.org/10.1007/978-3-642-39091-3 20
5. Hunter, A., Konieczny, S.: Approaches to measuring inconsistent information. In:
Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol.
3300, pp. 191–236. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-
540-30597-2 7
6. Jabbour, S., Ma, Y., Raddaoui, B.: Inconsistency measurement thanks to mus
decomposition. In: Scerri, L., Huhns, B. (eds.) Proceedings of the 13th International
Conference on Autonomous Agents and Multiagent Systems (AAMAS 2014), pp.
877–884 (2014)
7. Jabbour, S., Ma, Y., Raddaoui, B., Sais, L., Salhi, Y.: On structure-based inconsis-
tency measures and their computations via closed set packing. In: Proceedings of
the 14th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2015), pp. 1749–1750 (2015)
8. Kendall, M.: A new measure of rank correlation. Biometrika 30(1–2), 81–89 (1938)
9. Mu, K., Liu, W., Jin, Z., Bell, D.: A syntax-based approach to measuring the degree
of inconsistency for belief bases. Int. J. Approximate Reasoning 52(7), 978–999
(2011)
10. Thimm, M.: Tweety - a comprehensive collection of Java Libraries for logical
aspects of artificial intelligence and knowledge representation. In: Proceedings of
the 14th International Conference on Principles of Knowledge Representation and
Reasoning (KR 2014), pp. 528–537, July 2014
11. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
12. Thimm, M.: On the compliance of rationality postulates for inconsistency mea-
sures: a more or less complete picture. Künstliche Intell. 31(1), 31–39 (2017)
13. Thimm, M., Wallner, J.P.: Some complexity results on inconsistency measurement.
In: Proceedings of the 15th International Conference on Principles of Knowledge
Representation and Reasoning (KR 2016), pp. 114–123, April 2016
Inconsistency Measurement

Matthias Thimm(B)

University of Koblenz-Landau, Koblenz, Germany


[email protected]

Abstract. The field of Inconsistency Measurement is concerned with


the development of principles and approaches to quantitatively assess
the severity of inconsistency in knowledge bases. In this survey, we give
a broad overview on this field by outlining its basic motivation and dis-
cussing some of these core principles and approaches. We focus on the
work that has been done for classical propositional logic but also give
some pointers to applications on other logical formalisms.

1 Introduction
Inconsistency is a ubiquitous phenomenon whenever knowledge1 is compiled in
some formal language. The notion of inconsistency refers (usually) to multiple
pieces of information and represents a conflict between those, i. e., they cannot
hold at the same time. The two statements “It is sunny outside” and “It is not
sunny outside” represent inconsistent information and in order to draw meaning-
ful conclusions from a knowledge base containing these statements, this conflict
has to be resolved somehow. In applications such as decision-support systems,
a knowledge base is usually compiled by merging the formalised knowledge of
many different experts. It is unavoidable that different experts contradict each
other and that the merged knowledge base becomes inconsistent. The field of
Knowledge Representation and Reasoning (KR) [7] is the subfield of Artificial
Intelligence (AI) that deals with the issues of logical formalisations of informa-
tion and the modelling of rational reasoning behaviour, in particular in light
of inconsistent or uncertain information. One paradigm to deal with inconsis-
tent information is to abandon classical inference and define new ways of rea-
soning. Some examples of such formalisms are, e. g., paraconsistent logics [6],
default logic [34], answer set programming [15], and, more recently, computa-
tional models of argumentation [1]. Moreover, the fields of belief revision [21] and
belief merging [10,28] deal with the particular case of inconsistencies in dynamic
settings.
The field of Inconsistency Measurement—see the seminal work [20] and the
recent book [19]—provides an analytical perspective on the issue of inconsis-
tency. Its aim is to quantitatively assess the severity of inconsistency in order
1
We use the term knowledge to refer to subjective knowledge or beliefs, i. e., pieces of
information that may not necessary be true in the real world but are only assumed
to be true for the agent(s) under consideration.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 9–23, 2019.
https://doi.org/10.1007/978-3-030-35514-2_2
10 M. Thimm

to both guide automatic reasoning mechanisms and to help human modellers in


identifying issues and compare different alternative formalisations. Consider the
following two knowledge bases K1 and K2 formalised in classical propositional
logic (see Sect. 2 for the formal background) modelling some information about
the weather:

K1 = {sunny, ¬sunny, hot, ¬hot}


K2 = {¬hot, sunny, sunny → hot, humid}

Both K1 and K2 are classically inconsistent, i. e., there is no interpretation satis-


fying any of them. But looking closer into the structure of the knowledge bases
one can identify differences in the severity of the inconsistency. In K1 there are
two “obvious” contradictions, i. e., {sunny, ¬sunny} and {hot, ¬hot} are directly
conflicting formulas. In K2 , the conflict is a bit more hidden. Here, three for-
mulas are necessary to produce a contradiction ({¬hot, sunny, sunny → hot}).
Moreover, there is one formula in K2 (humid), which is not participating in any
conflict and one could still infer meaningful information from this by relying on
e. g. paraconsistent reasoning techniques [6]. In conclusion, one should regard
K1 as more inconsistent than K2 . So a decision-maker should prefer using K2
instead of K1 .
The analysis of the severity of inconsistency in the knowledge bases K1 and
K2 above was informal. Formal accounts to the problem of assessing the severity
of inconsistency are given by inconsistency measures and there have been a lot
of proposals of those in recent years. Up to today, the concept of severity of
inconsistency has not been axiomatised in a satisfactory manner and the series
of different inconsistency measures approach this challenge from different points
of view and focus on different aspects on what constitutes severity. Consider the
next two knowledge bases (with abstract propositions a and b)

K3 = {a, ¬a, b} K4 = {a ∨ b, ¬a ∨ b, a ∨ ¬b, ¬a ∨ ¬b}

Again both K3 and K4 are inconsistent, but which one is more inconsistent
than the other? Our reasoning from above cannot be applied here in the same
fashion. The knowledge base K3 contains an apparent contradiction ({a, ¬a})
but also a formula not participating in the inconsistency ({b}). The knowledge
base K4 contains a “hidden” conflict as four formulas are necessary to produce a
contradiction, but all formulas of K4 are participating in this. In this case, it is
not clear how to assess the inconsistency of these knowledge bases and different
measures may order these knowledge bases differently. More generally speaking,
it is not universally agreed upon which so-called rationality postulates should
be satisfied by a reasonable account of inconsistency measurement, see [3,5,41]
for a discussion. Besides concrete approaches to inconsistency measurement the
community has also proposed a series of those rationality postulates in order
to describe general desirable behaviour and the classification of inconsistency
measures by the postulates they satisfy is still one the most important ways to
evaluate the quality of a measure, even if the set of desirable postulates is not
universally accepted. For example, one of the most popular rationality postulates
Inconsistency Measurement 11

is monotony which states that for any K ⊆ K , the knowledge base K cannot
be regarded as more inconsistent as K . The justification for this demand is
that inconsistency cannot be resolved when adding new information but only
increased2 . While this is usually regarded as a reasonable demand there are also
situations where monotony may be seen as counterintuitive, even in monotonic
logics. Consider the next two knowledge bases

K5 = {a, ¬a} K6 = {a, ¬a, b1 , . . . , b998 }

We have K5 ⊆ K6 and following monotony, K6 should be regarded as least as


inconsistent as K5 . However, when judging the content of the knowledge bases
“relatively”, K5 may seem more inconsistent: K5 contains no useful information
and all formulas of K5 are in conflict with another formula. In K6 , however, only
2 out of 1000 formulas are participating in the contradiction. So it may also be
a reasonable point of view to judge K5 more inconsistent than K6 .
In this survey paper, we give a brief overview on formal accounts to inconsis-
tency measurement. We focus on approaches building on classical propositional
logic but also briefly discuss approaches for other formalisms. A more technical
survey of inconsistency measures can be found in [41] and the book [19] captures
the recent state-of-the-art as a whole. An older survey can also be found in [22].
The remainder of this paper is organised as follows. In Sect. 2 we give some
necessary technical preliminaries. Section 3 introduces the concept of inconsis-
tency measures formally and discusses rationality postulates. In Sect. 4 we dis-
cuss some of the most important concrete approaches to inconsistency mea-
surement for classical propositional logic and in Sect. 5 we give an overview on
approaches for other formalisms. Section 6 concludes.

2 Preliminaries

Let At be some fixed set of propositions and let L(At) be the corresponding
propositional language constructed using the usual connectives ∧ (conjunction),
∨ (disjunction), → (implication), and ¬ (negation).
Definition 1. A knowledge base K is a finite set of formulas K ⊆ L(At). Let K
be the set of all knowledge bases.
If X is a formula or a set of formulas we write At(X) to denote the set of
propositions appearing in X.
Semantics for a propositional language is given by interpretations where an
interpretation ω on At is a function ω : At → {true, false}. Let Ω(At) denote
the set of all interpretations for At. An interpretation ω satisfies (or is a model
of) a proposition a ∈ At, denoted by ω |= a, if and only if ω(a) = true. The
satisfaction relation |= is extended to formulas in the usual way.

2
At least in monotonic logics; for a discussion about inconsistency measurement in
non-monotonic logics see [9, 43] and Sect. 5.3.
12 M. Thimm

For Φ ⊆ L(At) we also define ω |= Φ if and only if ω |= φ for every φ ∈ Φ.


A formula or set of formulas X1 entails another formula or set of formulas X2 ,
denoted by X1 |= X2 , if and only if ω |= X1 implies ω |= X2 . If there is no ω
with ω |= X we also write X |=⊥ and say that X is inconsistent.

3 Measuring Inconsistency

Let R∞
≥0 be the set of non-negative real values including infinity. The most general
form of an inconsistency measure is as follows.
Definition 2. An inconsistency measure I is any function I : K → R∞
≥0 .

The above definition is, of course, under-constrained for the purpose of provid-
ing a quantitative means to measure inconsistency. The intuition we intend to
be behind any concrete approach to inconsistency measure I is that a larger
value I(K) for a knowledge base K indicates more severe inconsistency in K
than lower values. Moreover, we wish to reserve the minimal value (0) to indi-
cate the complete absence of inconsistency. This is captured by the following
postulate [23]:
Consistency I(K) = 0 iff K is consistent.

Satisfaction of the consistency postulate is a basic demand for any reasonable


inconsistency measure and is satisfied by all known concrete approaches [39,
41]. Beyond the consistency postulates a series of further postulates has been
proposed in the literature [41]. We only recall the basic ones initially proposed
in [23]. In order to state these postulates we need two further definitions.
Definition 3. A set M ⊆ K is a minimal inconsistent subset of K iff M |=⊥
and there is no M   M with M  |=⊥. Let MI(K) be the set of all minimal
inconsistent subsets of K.

Definition 4. A formula α ∈ K is called free formula if α ∈/ MI(K). Let
Free(K) be the set of all free formulas of K.

In other words, a minimal inconsistent subset characterises a minimal conflict in


a knowledge base and a free formula is a formula that is not directly participating
in any derivation of a contradiction. Let I be any function I : K → R∞ ≥0 ,
K, K ∈ K, and α, β ∈ L(At). The remaining rationality postulates from [23] are:

Normalisation 0 ≤ I(K) ≤ 1.
Monotony If K ⊆ K then I(K) ≤ I(K ).
Free-formula independence If α ∈ Free(K) then
I(K) = I(K \ {α}).
Dominance If α |=⊥ and α |= β then I(K ∪ {α}) ≥ I(K ∪ {β}).
Inconsistency Measurement 13

The postulate normalisation states that the inconsistency value is always in


the unit interval, thus allowing inconsistency values to be comparable even if
knowledge bases are of different sizes. Monotony requires that adding formulas
to the knowledge base cannot decrease the inconsistency value. Free-formula
independence states that removing free formulas from the knowledge base should
not change the inconsistency value. The motivation here is that free formulas do
not participate in inconsistencies and should not contribute to having a certain
inconsistency value. Dominance says that substituting a consistent formula α by
a weaker formula β should not increase the inconsistency value. Here, as β carries
less information than α there should be less opportunities for inconsistencies to
occur.
The five postulates from above are independent (no single postulates entails
another one) and compatible (as e. g. the drastic measure Id , see below, satisfies
all of them). However, they do not characterise a single concrete approach but
leave ample room for various different approaches. Moreover, for all rationality
postulates (except consistency) there is at least one inconsistency measure in
the literature that does not satisfy it [41] and there is no general agreement on
whether these postulates are indeed desirable at all [3,5,41]. We already gave
an example why monotony may not be desirable in the introduction. Here is
another example for free-formula independence taken from [3].
Example 1. Consider the knowledge base K7 defined via

K7 = {a ∧ c, b ∧ ¬c, ¬a ∨ ¬b}

Notice that K7 has a single minimal inconsistent subset {a ∧ c, b ∧ ¬c} and


¬a∨¬b is a free formula. If I satisfies free-formula independence we have I(K7 ) =
I(K7 \ {¬a ∨ ¬b}). However, ¬a ∨ ¬b adds another “conflict” about the truth of
propositions a and b.
We will continue the discussion on rationality postulates later in Sect. 6. But
first we will have a look at some concrete approaches.

4 Approaches

There is a wide variety of inconsistency measures in the literature, the work [41]
alone lists 22 measures in 2018 and more have been proposed since then3 . In this
paper we consider only a few to illustrate the main concepts.
The measure Id is usually referred to as a baseline for inconsistency measures
as it only distinguishes between consistent and inconsistent knowledge bases.

3
Implementations of most of these measures can also be found in the Tweety
Libraries for Artificial Intelligence [40] and an online interface is available at http://
tweetyproject.org/w/incmes.
14 M. Thimm

Definition 5 ([24]). The drastic inconsistency measure Id : K → R∞


≥0 is
defined as

1 if K |=⊥
Id (K) =
0 otherwise
for K ∈ K.
While not being particularly useful for the purpose of actually differentiating
between inconsistent knowledge bases, the measure Id already satisfies the basic
five postulates from above [24].
In [22] several dimensions for measuring inconsistency have been discussed.
A particular observation from this discussion is that inconsistency measures
can be roughly divided into two categories: syntactic and semantic approaches.
While this distinction is not clearly defined4 it has been used in following works
to classify many inconsistency measures. Using this categorisation, syntactic
approaches refer to inconsistency measures that make use of syntactic objects
such as minimal inconsistent sets (or maximal consistent sets). On the other
hand, semantic approaches refer to measures employing non-classical semantics
for that purpose. However, there are further measures which fall into neither (or
both) categories. In the following, we will look at some measures from each of
these categories.

4.1 Measures Based on Minimal Inconsistent Sets


A minimal inconsistent subset M of a knowledge base K represents the “essence”
of a single conflict in K. Naturally, a simple approach to measure inconsistency
is to take the number of minimal inconsistent subsets as a measure.
Definition 6 ([24]). The MI-inconsistency measure IMI : K → R∞
≥0 is defined
as IMI (K) = |MI(K)| for K ∈ K.
The above measure complies with the postulates of consistency, monotony,
and free-formula independence but fails to satisfy dominance and normalisation
(although a normalised variant that suffers from other shortcomings can easily
be defined). Table 2 below gives an overview on the compliance of the measures
formally considered in this paper with the basic postulates from above, see [41]
for proofs or references to proofs. The idea behind the MI-inconsistency measure
can be refined in several ways, taking e. g. the sizes of the individual minimal
inconsistent sets and how they overlap into account [13,25,26]. One example
being the following measure.
Definition 7 ([24]). The MIc -inconsistency measure IMIC : K → R∞
≥0 is
defined as
 1
IMIC (K) =
|M |
M ∈MI(K)

for K ∈ K.
4
And in this author’s opinion also a bit mislabelled.
Inconsistency Measurement 15

The MIc -inconsistency measure takes also the sizes of the individual minimal
inconsistent subsets into account. The intuition here is that larger minimal incon-
sistent subsets represent less inconsistency (as the conflict is more “hidden”) and
small minimal inconsistent subsets represent more inconsistency (as it is more
“apparent”).

Example 2. Consider again knowledge bases K1 and K2 from before defined via

K1 = {sunny, ¬sunny, hot, ¬hot}


K2 = {¬hot, sunny, sunny → hot, humid}

Here we have

IMI (K1 ) = 2 IMI (K2 ) = 1


IMIC (K1 ) = 1 IMIC (K2 ) = 1/3

Observe that, while IMI and IMIC disagree on the exact values of the inconsistency
in K1 and K2 they do agree on their order (K1 is more inconsistent than K2 ).
This is not generally true, consider

K8 = {a, ¬a}
K9 = {a1 , ¬a1 ∨ b1 , ¬b1 ∨ c1 , ¬ ∨ d1 , ¬d1 ∨ ¬a1 ,
a2 , ¬a2 ∨ b2 , ¬b2 ∨ c2 , ¬ ∨ d2 , ¬d2 ∨ ¬a2 }

IMI (K8 ) = 1 IMI (K9 ) = 2


IMIC (K8 ) = 1/2 IMIC (K9 ) = 2/5

where K8 is less inconsistent than K9 according to IMI and the other way around
for IMIC .

4.2 Measures Based on Non-classical Semantics

Measures based on minimal inconsistent subsets provide a formula-centric view


on the matter of inconsistency [22]. If a formula (as a whole) is part of a conflict, it
is taken into account for measuring inconsistency. Another possibility is to focus
on propositions rather than formulas. Consider again the knowledge base K7 =
{a ∧ c, b ∧ ¬c, ¬a ∨ ¬b} from Example 1 which possesses one minimal inconsistent
subset {a ∧ c, b ∧ ¬c}. However, it is clear that there is also a conflict involving
the propositions a and b, which is not “detected” by measures based on minimal
inconsistent subsets. Thus, another angle for measuring inconsistency consists
in counting how many propositions participate in the inconsistency. A possible
means for doing this is by relying on non-classical semantics. The contension
measure [17] makes use of Priest’s logic of paradox, which has a paraconsistent
semantics that we briefly recall now. A three-valued interpretation υ on At is a
function υ : At → {T, F, B} where the values T and F correspond to the classical
16 M. Thimm

true and false, respectively. The additional truth value B stands for both and is
meant to represent a conflicting truth value for a proposition. The function υ is
extended to arbitrary formulas as shown in Table 1. An interpretation υ satisfies
a formula α, denoted by υ |=3 α if either υ(α) = T or υ(α) = B. Define υ |=3 K
for a knowledge base K accordingly. Now inconsistency can be measured by
seeking an interpretation υ that assigns B to a minimal number of propositions.

Definition 8 ([17]). The contension inconsistency measure Ic : K → R∞


≥0 is
defined as

Ic (K) = min{|υ −1 (B) ∩ At| | υ |=3 K}

for K ∈ K.

Note that Ic is well-defined as for every knowledge K there is always at least


one interpretation υ satisfying it, e. g., the interpretation that assigns B to all
propositions.

Table 1. Truth tables for propositional three-valued logic.

α β υ(α ∧ β) υ(α ∨ β) α υ(¬α)


T T T T T F
T B B T B B
T F F T F T
B T B T
B B B B
B F F B
F T F T
F B F B
F F F F

A further approach—that is in contrast to Ic still formula-centric—is to make


use of probability logic to define an inconsistency measure[27]. A probability
function P on L(At) is a function P : Ω(At) → [0, 1] with ω∈Ω(At) P (ω) = 1.
We extend P to assign a probability to any formula φ ∈ L(At) by defining

P (φ) = P (ω)
ω|=φ

Let P(At) be the set of all those probability functions.


Definition 9 ([27]). The η-inconsistency measure Iη : K → R∞
≥0 is defined as

Iη (K) = 1 − max{ξ | ∃P ∈ P(At) : ∀α ∈ K : P (α) ≥ ξ}

for K ∈ K.
Inconsistency Measurement 17

The measure Iη looks for a probability function P that maximises the minimum
probability of all formulas in K. The larger this probability the less inconsistent
K is assessed (if there is a probability function assigning 1 to all formulas then
K is obviously consistent).

Example 3. Consider again knowledge bases K1 and K2 from before defined via

K1 = {sunny, ¬sunny, hot, ¬hot}


K2 = {¬hot, sunny, sunny → hot, humid}

Here we have

Ic (K1 ) = 2 Ic (K2 ) = 1
Iη (K1 ) = 0.5 Iη (K2 ) = 1/3

where, in particular, Ic also agrees with IMI (see Example 2). Consider now

K10 = {a, ¬a} K11 = {a ∧ b ∧ c, ¬a ∧ ¬b ∧ ¬c}

where

Ic (K1 ) = 1 Ic (K2 ) = 3
Iη (K1 ) = 0.5 Iη (K2 ) = 0.5
IMI (K1 ) = 1 Ic (K2 ) = 1

So Ic looks inside formulas to determine the severity of inconsistency.

While Ic makes use of paraconsistent logic and Iη of probability logic other


logics can be used for that purpose as well. In [38] a general framework is estab-
lished that allows to plugin any many-valued logic (such as fuzzy logic) to define
inconsistency measures.

4.3 Further Measures

There are further ways to define inconsistency measures that do not fall strictly
in one of the two paradigms above. We have a look at some now.
A simple approach to obtain a more proposition-centric measure (as Ic ) while
still relying on minimal inconsistent sets is the following measure.

Definition 10 ([44]). The mv inconsistency measure Imv : K → R∞


≥0 is
defined as

| M ∈MI(K) At(M )|
Imv (K) =
|At(K)|

for K ∈ K.
18 M. Thimm

In other words, Imv (K) is the ratio of the number of propositions that appear
in at least one minimal inconsistent set and the number of all propositions.
Another approach that makes no use of either minimal inconsistent sets or
non-classical semantics is the following one. A subset H ⊆ Ω(At) is called a
hitting set of K if for every φ ∈ K there is ω ∈ H with ω |= φ.
Definition 11 ([37]). The hitting-set inconsistency measure Ihs : K → R∞
≥0 is
defined as

Ihs (K) = min{|H| | H is a hitting set of K} − 1

for K ∈ K with min ∅ = ∞.


So Ihs seeks a minimal number of (classical) interpretations such that for each
formula there is at least one model in this set.
Example 4. Consider again knowledge bases K1 and K2 from before defined via

K1 = {sunny, ¬sunny, hot, ¬hot}


K2 = {¬hot, sunny, sunny → hot, humid}

Here we have

Imv (K1 ) = 1 Imv (K2 ) = 2/3


Ihs (K1 ) = 1 Ihs (K2 ) = 1

Moreover, Grant and Hunter [18] define new families of inconsistency mea-
sures based on distances of classical interpretations to being models of a knowl-
edge base. Besnard [4] counts how many propositions have to be forgotten—i. e.
removed from the underlying signature of the knowledge base—to turn an incon-
sistent knowledge base into a consistent one.

Table 2. Compliance of inconsistency measures with rationality postulates consistency


(CO), normalisation (NO), monotony (MO), free-formula independence (IN), and dom-
inance (DO)

I CO NO MO IN DO
Id ✓ ✓ ✓ ✓ ✓
IMI ✓ ✗ ✓ ✓ ✗
IMIC ✓ ✗ ✓ ✓ ✗
Ic ✓ ✗ ✓ ✓ ✓
Iη ✓ ✓ ✓ ✓ ✓
Imv ✓ ✓ ✗ ✗ ✗
Ihs ✓ ✗ ✓ ✓ ✓
Inconsistency Measurement 19

5 Beyond Propositional Logic


While most work in the field of inconsistency measurement is concerned with
using propositional logic as the knowledge representation formalism, there are
some few works, which consider measuring inconsistency in other logics. We will
have a brief overview on some of these works now, see [19] for some others.

5.1 First-Order and Description Logic

In [16], first-order logic is considered as the base logic. Allowing for objects and
quantification brings new challenges to measuring inconsistency as one should
distinguish in a more fine-grained manner how much certain formulas contribute
to inconsistency. For example, a formula ∀X : bird(X) → f lies(X)—which mod-
els that all birds fly—is probably the culprit of some inconsistency in any knowl-
edge base. However, depending on how many objects actually satisfy/violate
the implication, the severity of the inconsistency of the overall knowledge base
may differ (compare having a knowledge base with 10 flying birds and 1 non-
flying bird to a knowledge base with 1000 flying birds and 1 non-flying bird).
[16] address this challenge by proposing some new inconsistency measures for
first-order logic.
There are also several works—see e. g. [29,45]—that deal with measuring
inconsistency in ontologies formalised in certain description logics.

5.2 Probabilistic Logic

In probabilistic logic, classical propositional formulas are augmented by prob-


abilities yielding statements such as (sunny ∧ humid)[0.7] meaning “it will be
sunny and humid with probability 0.7”. Semantics are given to such a logic by
means of probability distributions over sets of propositions. Inconsistencies in
modelling with such a logic can appear, in particular, when “the numbers do
not add up”. In addition to the previous formula consider (humid)[0.5] which
states that “it will be humid with probability 0.5”. Both formulas together are
inconsistent as it cannot be the case the probability of being humid is at least
0.7 (which is implied by the first formula) and 0.5 at the same time. Measures
for probabilistic logic, see the recent survey [12], focus on measuring distances of
the probabilities of the formulas to a consistent state or propose weaker notions
of satisfying probability distributions and measure distances between those and
classical probability distributions.

5.3 Non-monotonic Logics

In non-monotonic logics, inconsistency in a knowledge base may be resolved by


adding formulas. Consider e. g. the following rules in answer set programming [8]:
{b ←, ¬b ← not a}. Informally, these rules state that b is the case and that if a is
not the case, ¬b is the case. The negation “not” is a negation-as-failure and the
20 M. Thimm

whole program is inconsistent as both b and ¬b can be derived. However, adding


the rule a ← stating that a is the case, makes the program consistent again as the
second rule is not applicable any more. An implication of this observation is that
consistent programs may have inconsistent subsets, which make the application
of classical measures based on minimal inconsistent sets useless. In [9] a stronger
notion for minimal inconsistent sets for non-monotonic logics is proposed that
is used for inconsistency measurement in [43], and, in particular, for answer set
programming in [42].

6 Summary and Discussion


In this paper we gave a brief overview on the field of inconsistency measurement.
We motivated the field, discussed several rationality postulates for concrete mea-
sures, and surveyed some of its basic approaches. We also gave a short overview
on approaches that use formalisms other than propositional logic as the base
knowledge representation formalism.
Inconsistency measures can be used to compare different formalisations of
knowledge, to help debug flawed knowledge bases, and guide automatic repair
methods. For example, inconsistency measures have been used to estimate reli-
ability of agents in multi-agent systems [11], to allow for inconsistency-tolerant
reasoning in probabilistic logic [33], or to monitor and maintain quality in
database settings [14].
Inconsistency measurement is a problem that is not easily defined in a formal
manner. Many approaches have been proposed, in particular in recent years, each
taking a different perspective on this issue. We discussed rationality postulates
as a means to prescribe general desirable behaviour of an inconsistency mea-
sure and there have also been a lot of proposals in the recent past, [41] lists an
additional 13 compared to the five postulates we discussed here. Many of them
are mutually exclusive, describe orthogonal requirements, and are not generally
accepted in the community. Besides rationality postulates, other dimensions for
comparing inconsistency measures are their expressivity and their computational
complexity. Expressivity [36,41] refers to the capability of an inconsistency to dif-
ferentiate between many inconsistent knowledge base. For example, the drastic
inconsistency measure—which assigns 1 to every inconsistent knowledge base—
has minimal expressivity as it can only differentiate between consistency and
inconsistency. On the other hand, the contension measure Ic can differentiate
up to n + 1 different states of inconsistency, where n is the number of propo-
sitions appearing in the signature. As for computational complexity, it is clear
that all problems related to inconsistency measurement are coNP-hard, as the
identification of unsatisfiability is always part of the definition. In fact, the deci-
sion problem of deciding whether a certain value is a lower bound for the actual
inconsistency value of a given inconsistency measure, is coNP-complete for many
measures such as Ic [35,41]. However, the problem is harder for other measures,
e. g., the same problem for Imv is already Σ2p -complete [44].
This paper points to a series of open research questions that may be inter-
esting to pursue. For example, the discussion on the “right” set of postulates
Inconsistency Measurement 21

is not over. What is needed is a characterising definition of an inconsistency


measure using few postulates, as the entropy is characterised by few simple
properties as an information measure. However, we are currently far away from
a complete understanding of what an inconsistency measure constitutes. More-
over, the algorithmic study of inconsistency measurement has (almost) not been
investigated at all. Although straightforward prototype implementations of most
measures are available5 , those implementations do not necessarily optimise run-
time performance. Only a few papers [2,30–32,37] have addressed this challenge
previously, mainly by developing approximation algorithms. Besides more work
on approximation algorithms, another venue for future work is also to develop
algorithms that work effectively on certain language fragments—such as certain
description logics—and thus may work well in practical applications.

Acknowledgements. The research reported here was partially supported by the


Deutsche Forschungsgemeinschaft (grant DE 1983/9-1).

References
1. Baroni, P., Gabbay, D., Giacomin, M., van der Torre, L. (eds.): Handbook of Formal
Argumentation. College Publications, London (2018)
2. Bertossi, L.: Repair-based degrees of database inconsistency. In: Balduccini, M.,
Lierler, Y., Woltran, S. (eds.) LPNMR 2019. LNCS, vol. 11481, pp. 195–209.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20528-7 15
3. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
4. Besnard, P.: Forgetting-based inconsistency measure. In: Schockaert, S., Senellart,
P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 331–337. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-45856-4 23
5. Besnard, P.: Basic postulates for inconsistency measures. In: Hameurlain, A.,
Küng, J., Wagner, R., Decker, H. (eds.) Transactions on Large-Scale Data- and
Knowledge-Centered Systems XXXIV. LNCS, vol. 10620, pp. 1–12. Springer,
Heidelberg (2017). https://doi.org/10.1007/978-3-662-55947-5 1
6. Béziau, J.-Y., Carnielli, W., Gabbay, D. (eds.): Handbook of Paraconsistency.
College Publications, London (2007)
7. Brachman, R.J., Levesque, H.J.: Knowledge Representation and Reasoning.
Morgan Kaufmann Publishers, Massachusetts (2004)
8. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance.
Commun. ACM 54(12), 92–103 (2011)
9. Brewka, G., Thimm, M., Ulbricht, M.: Strong inconsistency. Artif. Intell. 267,
78–117 (2019)
10. Cholvy, L., Hunter, A.: Information fusion in logic: a brief overview. In: Gabbay,
D.M., Kruse, R., Nonnengart, A., Ohlbach, H.J. (eds.) ECSQARU/FAPR-1997.
LNCS, vol. 1244, pp. 86–95. Springer, Heidelberg (1997). https://doi.org/10.1007/
BFb0035614
11. Cholvy, L., Perrussel, L., Thevenin, J.M.: Using inconsistency measures for esti-
mating reliability. Int. J. Approximate Reasoning 89, 41–57 (2017)
5
see http://tweetyproject.org/w/incmes.
22 M. Thimm

12. De Bona, G., Finger, M., Potyka, N., Thimm, M.: Inconsistency measurement in
probabilistic logic. In: Measuring Inconsistency in Information, College Publica-
tions (2018)
13. De Bona, G., Grant, J., Hunter, A., Konieczny, S.: Towards a unified framework
for syntactic inconsistency measures. In: Proceedings of AAAI 2018 (2018)
14. Decker, H., Misra, S.: Database inconsistency measures and their applications. In:
Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 254–265.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5 21
15. Gelfond, M., Leone, N.: Logic programming and knowledge representation - the
a-prolog perspective. Artif. Intell. 138(1–2), 3–38 (2002)
16. Grant, J., Hunter, A.: Analysing inconsistent first-order knowledgebases. Artif.
Intell. 172(8–9), 1064–1093 (2008)
17. Grant, J., Hunter, A.: Measuring consistency gain and information loss in stepwise
inconsistency resolution. In: Liu, W. (ed.) ECSQARU 2011. LNCS (LNAI), vol.
6717, pp. 362–373. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-
642-22152-1 31
18. Grant, J., Hunter, A.: Analysing inconsistent information using distance-based
measures. Int. J. Approximate Reasoning 89, 3–26 (2017)
19. Grant, J., Martinez, M.V. (eds.): Measuring Inconsistency in Information. College
Publications, London (2018)
20. Grant, J.: Classifications for inconsistent theories. Notre Dame J. Form. Log. 19(3),
435–444 (1978)
21. Hansson, S.O.: A Textbook of Belief Dynamics. Kluwer Academic Publishers,
Dordrecht (2001)
22. Hunter, A., Konieczny, S.: Approaches to measuring inconsistent information. In:
Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol.
3300, pp. 191–236. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-
540-30597-2 7
23. Hunter, A., Konieczny, S.: Shapley inconsistency values. In: Proceedings of KR
2006, pp. 249–259 (2006)
24. Hunter, A., Konieczny, S.: Measuring inconsistency through minimal inconsistent
sets. In: Proceedings of KR 2008, pp. 358–366 (2008)
25. Jabbour, S., Ma, Y., Raddaoui, B.: Inconsistency measurement thanks to MUS
decomposition. In: Proceedings of AAMAS 2014, pp. 877–884 (2014)
26. Jabbour, S.: On inconsistency measuring and resolving. In: Proceedings of ECAI
2016, pp. 1676–1677 (2016)
27. Knight, K.M.: Measuring inconsistency. J. Philos. Log. 31, 77–98 (2001)
28. Konieczny, S., Pino Pérez, R.: On the logic of merging. In: Proceedings of KR 1998
(1998)
29. Ma, Y., Hitzler, P. : Distance-based measures of inconsistency and incoherency for
description logics. In: Proceedings of DL 2010 (2010)
30. Ma, Y., Qi, G., Xiao, G., Hitzler, P., Lin, Z.: An anytime algorithm for comput-
ing inconsistency measurement. In: Karagiannis, D., Jin, Z. (eds.) KSEM 2009.
LNCS (LNAI), vol. 5914, pp. 29–40. Springer, Heidelberg (2009). https://doi.org/
10.1007/978-3-642-10488-6 7
31. Ma, Y., Qi, G., Xiao, G., Hitzler, P., Lin, Z.: Computational complexity and any-
time algorithm for inconsistency measurement. Int. J. Softw. Inform. 4(1), 3–21
(2010)
32. McAreavey, K., Liu, W., Miller, P.: Computational approaches to finding and mea-
suring inconsistency in arbitrary knowledge bases. Int. J. Approximate Reasoning
55, 1659–1693 (2014)
Inconsistency Measurement 23

33. Potyka, N., Thimm, M.: Inconsistency-tolerant reasoning over linear probabilistic
knowledge bases. Int. J. Approximate Reasoning 88, 209–236 (2017)
34. Reiter, R.: A logic for default reasoning. Artif. Intell. 13, 81–132 (1980)
35. Thimm, M., Wallner, J. P.: Some complexity results on inconsistency measurement.
In: Proceedings of KR 2016, pp. 114–123 (2016)
36. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
37. Thimm, M.: Stream-based inconsistency measurement. Int. J. Approximate Rea-
soning 68, 68–87 (2016)
38. Thimm, M.: Measuring inconsistency with many-valued logics. Int. J. Approximate
Reasoning 86, 1–23 (2017)
39. Thimm, M.: On the compliance of rationality postulates for inconsistency mea-
sures: a more or less complete picture. Künstliche Intell. 31(1), 31–39 (2017)
40. Thimm, M.: The tweety library collection for logical aspects of artificial intelligence
and knowledge representation. Künstliche Intell. 31(1), 93–97 (2017)
41. Thimm, M.: On the evaluation of inconsistency measures. In: Measuring Inconsis-
tency in Information. College Publications (2018)
42. Ulbricht, M., Thimm, M., Brewka, G.: Inconsistency measures for disjunctive logic
programs under answer set semantics. In: Measuring Inconsistency in Information.
College Publications (2018)
43. Ulbricht, M., Thimm, M., Brewka, G.: Measuring strong inconsistency. In: Pro-
ceedings of AAAI 2018, pp. 1989–1996 (2018)
44. Xiao, G., Ma, Y.: Inconsistency measurement based on variables in minimal unsat-
isfiable subsets. In: Proceedings of ECAI 2012 (2012)
45. Zhou, L., Huang, H., Qi, G., Ma, Y., Huang, Z., Qu, Y.: Measuring inconsistency
in DL-lite ontologies. In: Proceedings of WI-IAT 2009, pp. 349–356 (2009)
Using Graph Convolutional Networks
for Approximate Reasoning with Abstract
Argumentation Frameworks: A Feasibility
Study

Isabelle Kuhlmann and Matthias Thimm(B)

University of Koblenz-Landau, Koblenz, Germany


[email protected]

Abstract. We employ graph convolutional networks for the purpose


of determining the set of acceptable arguments under preferred seman-
tics in abstract argumentation problems. While the latter problem is
complexity-wise one of the hardest problems in reasoning with abstract
argumentation problems, approximate methods are needed here in order
to obtain a practically relevant runtime performance. This first study
shows that deep neural network models such as graph convolutional net-
works significantly improve the runtime while keeping the accuracy of
reasoning at about 80% or even more.

Keywords: Neural network · Reasoning · Abstract argumentation

1 Introduction

Computational models of argumentation [3] are approaches for non-monotonic


reasoning that focus on the interplay between arguments and counterarguments
in order to reach conclusions. These approaches can be divided into either
abstract or structured approaches. The former encompass the classical abstract
argumentation frameworks following Dung [9] that model argumentation sce-
narios by directed graphs, where vertices represent arguments and directed links
represent attacks between arguments. In these graphs one is usually interested
in identifying extensions, i.e., sets of arguments that are mutually acceptable
and thus provide a coherent perspective on an outcome of the argumentation.
On the other hand, structured argumentation approaches consider arguments
to be collections of formulas and/or rules which entail some conclusion. The
most prominent structured approaches are ASPIC+ [21], ABA [26], DeLP [13],
and deductive argumentation [4]. These approaches consider a knowledge base
of formulas and/or rules as a starting point.
In this paper, we are interested in approximate methods to reasoning with
abstract argumentation approaches. Previous works on reasoning with abstract
argumentation focus mostly on sound and complete methods, see e.g. [5] for
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 24–37, 2019.
https://doi.org/10.1007/978-3-030-35514-2_3
Using Graph Convolutional Networks for Abstract Argumentation 25

a recent survey and the International Competition on Computational Models


of Argumentation1 (ICCMA) [12,25] for actual implementations. To the best
of our knowledge, the only incomplete algorithms for abstract argumentation
are [22,24] that use stochastic local search. Here, we use deep neural networks
to model the problem of deciding (credulous) acceptability of arguments wrt.
preferred semantics as a classification problem. We train a graph convolutional
neural network [17]—a special form of a convolutional neural network that is
tailored towards processing of graphs—with data obtained by random generation
of abstract argumentation frameworks and annotated by a sound and complete
solver (in our case CoQuiAAS [19]). After training, the obtained classifier can be
used to solve the acceptability problem in constant time. However, the obtained
classifier provides only an approximation to the actual answer. Our experiments
showed that approximation quality is about 80% in general, while it can be up to
99% in certain cases.
The remainder of this paper is structured as follows. In Sect. 2, the basic
concepts of abstract argumentation and artificial neural networks are recalled.
Section 3 explains the approach of representation the acceptability problems as
a classification problem. Section 4 describes our experimental evaluation and
discusses its results. We conclude in Sect. 5 with a discussion and summary.

2 Preliminaries
In the following, we recall basic definitions of abstract argumentation and arti-
ficial neural networks.

2.1 Abstract Argumentation


An abstract argumentation framework [9] AF is a tuple AF = (Arg, →) where
Arg is a set of arguments and → ⊆ Arg × Arg is the attack relation.
Semantics are given to abstract argumentation frameworks by means of
extensions. A set of arguments E ⊆ Arg is called an extension if it fulfils cer-
tain conditions. There are various types of extensions, however this paper will
be focused on the four classical types proposed by Dung [9]. Namely, these are
complete, grounded, preferred, and stable semantics. All of these types of exten-
sions must be conflict-free. A set of arguments E ⊆ Arg in an argumentation
framework AF = (Arg, →) is conflict-free, iff there are no arguments A,B ∈ E
with A → B.
Moreover, an argument A is called acceptable with respect to a set of argu-
ments E ⊆ Arg iff for every B ∈ Arg with B → A there is an argument A ∈ E
with A → B. Based on these definitions, the four different types of extensions
are defined for an argumentation framework AF = (Arg, →) as follows:
1. Complete extension: A set of arguments E ⊆ Arg is called a complete
extension iff it is conflict-free, all arguments A ∈ E are acceptable with
1
http://argumentationcompetition.org.
26 I. Kuhlmann and M. Thimm

Fig. 1. Artificial neuron, adapted from https://inspirehep.net/record/1300728/plots

respect to E and there is no argument B ∈ Arg \ E that is acceptable with


respect to E.
2. Grounded extension: A set of arguments E ⊆ Arg is called a grounded
extension iff it is complete and E is minimal with respect to set inclusion.
3. Preferred extension: A set of arguments E ⊆ Arg is called a preferred
extension iff it is complete and E is maximal with respect to set inclusion.
4. Stable extension: A set of arguments E ⊆ Arg is called a stable extension
iff it is complete and ∀A ∈ Arg\E : ∃B ∈ E with B → A.

2.2 Artificial Neural Networks and Graph Convolutional Networks

An artificial neural network (henceforth also referred to as neural network or sim-


ply network ) generally consists of multiple artificial neurons that are connected
with each other. In biology, a neuron is a nerve cell that occurs, for example,
in the brain or in the spinal cord. Neurons are specialised on conducting and
transferring stimuli [23]. In computer science, (artificial) neurons denote a data
structure that was developed to work similarly to their biological example. It
is to be noted that there exist different models of artificial neurons and neural
networks. Due to its contextual relevance in this paper, solely the structure and
functionality of the multilayer perceptron model [14] will be described.
An artificial neuron can have multiple inputs xi ∈ R with i ∈ {1, . . . ,n} that
form the input vector x = (x1 , . . . ,xn ) . Each of the n inputs is multiplied by
a weight wi . In addition to the regular inputs, there are so-called bias inputs b.
They serve the purpose of stabilising the computation. As visualised in Fig. 1, an
activation function f (·) is applied to the sum of all weighted inputs. The result
of the function is the neuron’s output [8,16].
Analogously to the biological prototype, artificial neurons are connected to
networks. Such networks are usually arranged in layers that consist of at least
one neuron. There is one input layer, one or more so-called hidden layers, and
one output layer. It is to be noted that the input layer is considered a layer only
for convenience, because it only passes the input values to the next layer with-
out further processing [16,20]. Neural networks can be understood as graphs,
Using Graph Convolutional Networks for Abstract Argumentation 27

with neurons as nodes and their connections as edges. For training neural net-
works, the back-propagation algorithm is used in most cases. Back-propagation
is a supervised learning method, meaning that at all times during training, the
output corresponding to the current input must be known. The goal is to find
the most exact mapping of the input vectors to their output vectors. This is
realised by adjusting the weights on the edges of the graph, see [16] for details.
In the context of graph theory, Kipf et al. [17] introduce graph convolutional
networks that are able to directly use graphs as input instead of a vector of reals.
More precisely, they introduce a layer-wise propagation rule for neural networks
that operates directly on graphs. It is formulated as follows:
 1 1

H (l+1) = σ D̃− 2 ÃD̃− 2 H (l) W (l) (1)

H (l) ∈ RN ×D denotes the matrix of activations in the lth layer. σ(·) is an


activation
 function, such as ReLU (Rectified Linear Units) [18]. Moreover,
Dii = j Ãi j and à = A+IN , where A is the adjacency matrix of the graph and
IN is the identity matrix. W (l) denotes a layer-specific trainable weight matrix.
Spectral convolutions on graphs are defined as

gθ ∗ x = U gθ U  x. (2)

A signal x ∈ RN (a scalar for every node) is multiplied by a filter gθ = diag(θ),


which is parameterised by θ in the Fourier domain. U is the matrix of Eigenvec-
1 1
tors of the normalised graph Laplacian L = IN − D− 2 AD− 2 = U ΛU  , where
Λ is a diagonal matrix of the Laplacian’s Eigenvalues. U  x is the graph Fourier
transform of x [17].
For a number of reasons, evaluating Eq. (2) is computationally expensive. For
example, computing the Eigendecomposition of L might become rather expensive
for large graphs. Hammond et al. [15] suggest that gθ (Λ) can be approximated
by a truncated expansion in terms of Chebyshev polynomials in order to avoid
this problem:
K

gθ (Λ) ≈ θk Tk (Λ̃) (3)
k=0

Tk (x) denotes the Chebyshev polynomials up to K th order. The matrix Λ is


rescaled to Λ̃ = λmax
2
Λ − IN , where λmax describes the largest Eigenvalue of
L. Besides, θ ∈ RK is now a vector of Chebyshev coefficients. Integrating this
approximation into the definition of a convolution of a signal x with a filter gθ
yields
K

gθ  ∗ x ≈ θk Tk (L̃)x, (4)
k=0

with L̃ = λmax L − IN [17]. Because this convolution is a K th -order polynomial


2

in the Laplacian, it is K-localized. This means, it depends only on a certain


neighbourhood—more specifically: it only depends on nodes which are at maxi-
mum K steps away from the central node.
28 I. Kuhlmann and M. Thimm

Stacking multiple convolutional layers in the form of Eq. (4) (each layer fol-
lowed by a point-wise non-linearity) leads to a neural network model that can
directly process graphs.

3 Casting the Acceptability Problem as a Classification


Problem

In abstract argumentation there are several interesting decision problems with


varying complexity [10]. For example, the problem Credσ with σ being either
complete, grounded, preferred, or stable semantics, asks for a given AF = (Arg, →)
and an argument A ∈ Arg, whether A is contained in at least one σ-extension
of AF. For preferred semantics this is an NP-complete problem [10]. For our first
feasibility study here, we will focus on this problem, i.e., CredP R .
In order to represent CredP R as a classification problem, we assume that for
any given input argumentation framework AF = (Arg, →) we have an arbitrary
but fixed order of the arguments, i.e., Arg = {A1 , . . . ,An }. Moreover, let A
denote the set of all abstract argumentation frameworks and V the set of all
vectors with values in [0,1] of arbitrary dimension. Conceptually, our classifier C
then will be a function of the type C : A → V with |C(Arg, →)| = |Arg|, i.e., on
an input argumentation framework with n arguments we get an n-dimensional
real vector as the result.2 The interpretation of this output then is that the i-th
entry of C(Arg, →) denotes the likelihood of argument Ai being credulously
accepted wrt. preferred semantics. Of course, a sound and complete classifier C
should output 1 whenever this is true and 0 otherwise. However, as we will only
approximate the true solution, all values in the interval [0,1] are possible.
The function C, in our case represented by a graph convolutional network,
will be trained on benchmark graphs where the gold standard, i.e. the true
solutions, is available, e.g., by means of asking a complete oracle solver. Given
enough and diverse benchmark graphs for training, our main hypothesis is that
C approximates the intended behaviour.

4 Experimental Evaluation

The framework for graph convolutional networks (GCNs) offered by Kipf et al.
[17], which is realised with the aid of Google’s TensorFlow [1], is designed to
find labels for certain nodes of a given graph and is thus a reasonable starting
point for examining if it is possible to decide whether an argument is credulously
accepted wrt. preferred semantics by the use of neural networks.

2
Note that implementation-wise this is not completely true as the size of the output
vector has to be fixed.
Using Graph Convolutional Networks for Abstract Argumentation 29

4.1 Datasets

An essential part of any machine learning task is collecting sufficient training


and test data. The probo3 [7] benchmark suite can be used to generate graphs
with different properties. A solver such as CoQuiAAS [19] can then be used
to compute the corresponding extensions. The suite offers three different graph
generators that each yield graphs with different properties. The first one, the
GroundedGenerator, produces graphs that have a large grounded extension. The
SccGenerator produces graphs that are likely to have many strongly connected
components. Lastly, the StableGenerator generates graphs that are likely to have
many stable, preferred, and complete extensions. To provide even more diversity
in the data, we use AFBenchGen 4 [6] as a second graph generator. It generates
random scale-free graphs by using the Barabási-Albert model [2], as well as graphs
using the Watts-Strogatz model [27], and the Erdős-Rényi model [11].
In order to examine the impact of the training set size on the classification
results, a number of different-sized datasets is generated. It is to be noted that
each dataset contains the next smaller dataset in addition to some new data. This
strategy is supposed to keep changes in the character of the dataset minimal.
The test set is, of course, an exception from this rule. Moreover, each dataset
(including the test set) is composed of equal shares of all six previously described
types of graphs, and all graphs have between 100 and 400 nodes. Table 1 gives
an overview.
In addition to the specifically generated test set, a fraction of the bench-
mark dataset used in the International Competition on Computational Models
of Argumentation (ICCMA) 2017 [12] is used in order to examine how a trained
model performs on external data. Said fraction consists of 45 graphs of group B
(the only one designated for solvers of CredP R ) that were chosen from all five
difficulty categories.

Table 1. Dataset overview.

ID Number of graphs Total number of nodes


5-of-each 30 5,461
10-of-each 60 12,056
25-of-each 150 32,026
50-of-each 300 73,717
75-of-each 450 108,050
100-of-each 600 149,130
Test 120 30,603

3
https://sourceforge.net/projects/probo/.
4
https://sourceforge.net/p/afbenchgen/wiki/Home/.
30 I. Kuhlmann and M. Thimm

4.2 Experimental Setup

The GCN framework [17] was designed to perform node-wise classification on


a single large graph in a semi-supervised fashion. In order to use the GCN
framework in its intended way, three different matrices need to be provided: an
N × N adjacency matrix (N : number of nodes), an N × D feature matrix (D:
number of features per node), and an N × F binary label matrix (F : number of
classes).
For this work, the training process should be supervised rather than semi-
supervised. However, the set of unlabeled nodes can be left empty. Because all
nodes consequently have a known label, the training process becomes supervised
instead of semi-supervised. Besides, instead of one single graph with some nodes
to be classified, entire sets of graphs are supposed to provide the training and
test sets. To realise this, the graphs in both training and test set are considered
one big graph. This yields an adjacency matrix that essentially contains the
adjacency matrices of all graphs. The graphs belonging to the test set make up
the set of nodes that are to be classified.
The feature matrix can be used to provide additional information on the con-
tent of the nodes that could be used to improve classification. However, defining
an appropriate feature matrix is a rather difficult matter in our application sce-
nario, because the nodes do not contain any information, in contrast to, for
example, social networks or citation networks. In Sect. 4.3, two different solu-
tions are explored. The first one is a simple N × 1 matrix that contains the same
constant for every node (which means that no additional features are provided
for the nodes). For the second option, the number of incoming and outgoing
attacks per argument are used as features, resulting in an N × 2 matrix (one
column for each incoming and outgoing attacks).

4.3 Results

When dealing with artificial neural networks, quite a few parameters can influ-
ence the outcome of the training process. The following section describes various
experimental results in which the impact of different factors on the quality of
the classification process is examined. Those factors include, for instance, the
size and nature of the training set, the learning rate, and the number of epochs
being used to train the neural network model. Finally, we report on some runtime
comparison with a sound and complete solver.

Feature Matrix. As explained in Sect. 4.2, there are two different types of
feature matrix that may be used in the training process. While training with
the feature matrix that does not contain any features (henceforth referred to
as fm1) always results in an accuracy of 77.0%, training with the matrix that
encodes incoming and outgoing attacks as features (henceforth referred to as
fm2) offers slightly better results (up to 80.3%). Accuracy is measured by divid-
ing the number of correct predictions by the total number of predictions. The
Using Graph Convolutional Networks for Abstract Argumentation 31

Table 2. Accuracy per class for both feature matrix types.

fm1 fm2
Accuracy Yes Accuracy No Accuracy Yes Accuracy No
0.0000 1.0000 0.1499 0.9846
0.0000 1.0000 0.2025 0.9810
0.0000 1.0000 0.2083 0.9803

Table 3. Training results for individual graph types and parameter settings for train-
ing. Additional parameters were set as follows: number of epochs: 500, learning rate:
0.001, dropout: 0.05.

Barabási-Albert Erdős-Rényi Grounded Scc Stable Watts-Strogatz


Accuracy Yes 1.0000 0.0000 0.0771 0.0000 0.0000 0.0000
Accuracy No 0.0000 1.0000 0.9950 1.0000 1.0000 1.0000
Accuracy total 0.8421 0.8152 0.7109 0.9886 0.8421 0.9988
F1 Score 0.0000 0.0000 0.1417 0.0000 0.0000 0.0000

accuracy value for class Yes can also be viewed as the recall value, which is cal-
culated by dividing the number of true positives by the sum of true positives and
false negatives. Moreover, by calculating the precision (true positives divided by
the sum of true positives and false positives), the F1 score can be obtained as
follows:
Precision · Recall
F1 = 2 · (5)
Precision + Recall
Moreover, because it seems unusual that multiple different training setups
all return the same value, it is important to also look into the class-specific
accuracies. Table 2 reveals that the network only learned to classify all nodes as
No when trained with fm1. Incorporating fm2 into the training process leads
to an accuracy of class Yes of up to 20.8%. Whereas this result still needs
optimisation, it shows that using fm2 is the more promising approach. In all
following experiments, fm2 is used.

Graph Types. In order to further investigate the background of the prior


results, the different graph types are examined. Six additional datasets that
consist of one graph type each, are created. Each one contains 100 graphs for
training and 20 graphs for testing. Essentially, the 100-of-each training set and
the test set are split into six subsets consisting of only one graph type per set.
In Table 3, the training results, alongside the settings that were used to
retrieve these values, are presented. Several observations can be made from the
results. Firstly, a set of parameter settings does not work equally well on all graph
types. While four out of six graph types only learn to decide on one class for
all instances, Grounded and Stable graphs show first signs of a deeper learning
32 I. Kuhlmann and M. Thimm

Table 4. Classification results after training with different-sized training sets. Param-
eter settings: epochs: 500, learning: 0.01, dropout rate: 0.05. However, a difference in
training set size might require different settings. For example, a larger dataset might
need more epochs to converge than a smaller ones.

Dataset Accuracy Yes Accuracy No Accuracy total F1 score


5-of-each 0.0000 1.0000 0.7701 0.0000
10-of-each 0.1869 0.9795 0.7972 0.2976
25-of-each 0.2025 0.9810 0.8020 0.3199
50-of-each 0.2170 0.9797 0.8043 0.3377
75-of-each 0.2174 0.9793 0.8041 0.3380
100-of-each 0.2210 0.9786 0.8044 0.3419

process. Increasing the number of epochs to 1000 yields exactly the same accu-
racies for Barabási-Albert, Erdős-Rényi, Scc, and Watts-Strogatz graphs, but
improves the values for Grounded and Stable. This leads to the assumption that
the graph types are of different difficulty for the network to learn. The fact that
98.86% (Scc) or even 99.89% (Watts-Strogatz) of the graphs’ nodes belong to
one class supports this assumption. Classifying such unevenly distributed classes
is quite a difficult task for a neural network.
Another observation is that the set of Barabási-Albert graphs is the only one
where the majority of instances is in the class Yes. This might help creating
a dataset with more evenly distributed classes. Generally, it is certainly helpful
to have some graphs with more Yes instances in a dataset in order to generate
more diversity. Having a diverse dataset is a vital aspect when training neural
networks. Otherwise, the network might overfit to irrelevant features or might
not work for some application scenarios.

Dataset Size. Besides the influence of a dataset’s diversity, the amount of data
also has an impact on the training process. Table 4 shows some classification
results for the different datasets described in Sect. 4.1. As expected, it indicates
that bigger training sets have a greater potential to improve classification results.
Nonetheless, utilizing more training data does not automatically mean better
results. As displayed in Table 4, adding more than 50 graphs of each type does
not yield a significant increase in accuracy. The values for overall accuracy and
accuracy for class No do not change much at all (both less than 3.5%) when
adding more training data. It is, however, crucial to look into the accuracy of
class Yes as well as the F1 scores, because it indicates that the network actually
learned some features of a preferred extension, instead of guessing No for all
instances. Training with 25 graphs per type (150 in total) already results in
20.25% accuracy of class Yes—only 1.85% less than a training with a total of 600
graphs yields. Training with 50 graphs per type increases the accuracy for Yes
by another 1.45%, which may still be regarded as significant when considering
that the difference to the next bigger training set is merely 0.04%. In summary,
Using Graph Convolutional Networks for Abstract Argumentation 33

Table 5. Classification results after training with a more balanced dataset in regard
to instances per class.

Number of Learning Dropout Accuracy Accuracy Accuracy F1 score


epochs rate Yes No total
500 0.1 0.05 0.2488 0.9705 0.8045 0.3693
500 0.01 0.05 0.2589 0.9669 0.8041 0.3781
500 0.001 0.05 0.2372 0.9735 0.8042 0.3578
250 0.01 0.05 0.2659 0.9644 0.8037 0.3839
750 0.01 0.05 0.2728 0.9622 0.8037 0.3899
500 0.01 0.01 0.2682 0.9637 0.8038 0.3859
500 0.01 0.1 0.2494 0.9697 0.8041 0.3693

the increase in accuracy for class Yes rather quickly starts stagnating when
more data is added.

Optimisation. Training a neural network is a task that demands careful adjust-


ment of various parameters and other aspects. This section describes several
approaches that may optimise the results gathered so far.
The main problem with the previous results is that the model seems to under-
fit. A reason for that might be that the training set is badly balanced in terms
of number of instances per class. A dataset where the two classes are about
equally distributed might lead to an improvement. Therefore, an additional
training set is generated, which consists of 100 Barabási-Albert graphs and a
total of 100 graphs of the other types (20 graphs of each). The results for train-
ing with this dataset under different parameter settings (regarding the learning
rate, number of epochs, and dropout rate) are displayed in Table 5. It becomes
clear that the overall accuracy does not improve significantly in comparison to
the previous results. Nevertheless, the accuracy of class Yes increased to values
between 23.72% (500 epochs, learning rate 0.001, dropout 0.05) and 27.28% (750
epochs, learning rate 0.01, dropout 0.05). So, these results might be considered
a slight improvement, because they are more evenly distributed than the former
ones. Another observation is that changes in number of epochs, learning rate, or
dropout rate do not lead to any significant improvements in total accuracy. In
fact, most alterations in parameter settings yield slightly worse results.
Looking into the actual numbers of instances of Yes and No reveals that
instances of the latter class are still the majority (54.4%). To further equalize the
number of instances per class, the training set is augmented by 27 more Barabási-
Albert graphs (7300 arguments). The distribution of ground truth labels is now
50.6% YES and 49.4% No, respectively. Training the neural network with this
dataset (parameters are set to 500 epochs, a learning rate of 0.01, and a dropout
rate of 0.05) results in a total accuracy of 80.0%. However, the accuracy of class
Yes increased to 29.7%, while the corresponding value for class No marginally
decreased to 95.0%. This demonstrates that using a more balanced training set
34 I. Kuhlmann and M. Thimm

0.62
Accuracy total
0.63

0.97
Accuracy No
0.93

0.1 Yes/No balanced


Accuracy Yes
0.17 50-of-each

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 2. Results for testing with benchmark data.

(in respect of instances per class) also leads to more balanced results. Since the
test set consists of 77.0% instances of class No, the total accuracy does not
increase, though.

Competition Data. In order to get a sense of how the training results transfer
to other data, two differently trained models are tested on the competition data
(see Sect. 4.1). The first model is trained with the 50-of-each dataset. The learn-
ing rate is set to 0.01, dropout to 0.05, and number of epochs to 500. The second
model uses the same settings, but is trained with the more balanced dataset con-
taining 127 Barabási-Albert graphs and 100 others as illustrated above. Figure 2
displays a comparison of the results. The overall accuracy is very similar for both
training sets: about 17% lower than for the regular test set, and the class-specific
accuracy values are lower, too. This might be due to the benchmark dataset con-
taining graphs that are smaller or larger than the ones in the training set. Also,
additional types of graphs are included in the benchmark dataset.

Runtime Performance. Aside from the quality of the classification results,


another aspect that needs to be considered is the time efficiency. In order to put
GCN’s efficiency into perspective, it is compared to CoQuiAAS, the SAT-based
argumentation solver used to provide ground truth labels for the training and
test sets.
For the GCN approach, only the time for evaluating the test set is mea-
sured, since a neural network can, once it is trained, classify as many arguments
as one wishes. Both methods are evaluated on classifying the entire test set
(see Sect. 4.1) using the same hardware. The difference is enormous: While the
GCN classifies the entire test set within <0.5 s, CoQuiAAS needs about an hour
(60.98 min). It is to be noted that the value for testing using a trained GCN
varies a bit depending on the training conditions. For example, a measurement
taken after training with the biggest training set (600 graphs) is 0.22 s. Training
with half the data lead to 0.13 s.
Table 6 reveals the big fluctuations in the amounts of time CoQuiAAS needs
to decide for a single argument whether it is included in a preferred extension
Using Graph Convolutional Networks for Abstract Argumentation 35

Table 6. Time measurements in comparison.

Method Property Time in seconds


CoQuiAAS Maximum 19.274452
CoQuiAAS Mean 0.119561
CoQuiAAS Minimum 0.002222
GCN Mean 0.000007

or not. While the lowest value is at 0.002 s, the highest one is at 19.27 s—which
is about 8674 times as much. It is also worth noting that, if evaluating the
whole test set takes the GCN 0.22 s, it takes an average of 7 · 10−6 = 0.000007 s.
That means, the minimal amount of time CoQuiAAS needed to evaluate an
argument is still 317 times as much as the average amount of time the GCN takes.
We only report on the mean runtime for the GCN approach as classification is
independent of the instance, it is only polynomial in the size of the trained
network. It follows that the GCN approach has constant runtime wrt. the size
of the instance.
Of course, one needs to consider that a neural network also needs time for
training and possibly for preprocessing. Using the GCN framework, the training
process took approximately between 20 min and two hours—depending on the
dataset size and the parameter settings such as number of epochs or learning
rate. For other network models and frameworks, training might take a lot longer.
Nonetheless, once sufficient data is provided and the network is trained, it can
be used for any test set and it is extremely fast.

5 Conclusion
All in all, the attempt of training a graph-convolutional network on abstract
argumentation frameworks in order to decide whether an argument is included
in a preferred extension or not was rather moderate. The overall accuracy did
under no circumstances exceed 80.5%. When testing with benchmark data, it
was even lower (63%). However, extending the diversity of the training set, for
instance, by adding different-sized graphs or by adding new types of graphs,
might improve this result.
Furthermore, training a neural network model involves adjusting a great
number of parameters. Also, some of these parameters depend on each other.
Considering that training a neural network requires careful adaption of the train-
ing data, the parameter settings, and the network architecture itself, and that
some aspects also affect others, examining all reasonable possibilities exceeds the
extent of this work.
The training results are moderate: On the one hand, the overall classifica-
tion accuracy does not exceed 80.5%, which is not good enough for practical
applications, but on the other hand, it proves that the network learned at least
some rudimental features of a preferred extension. The fact that instances from
36 I. Kuhlmann and M. Thimm

both classes can be classified correctly reinforces this statement. The accuracy
for class Yes is far lower (<30%) than the accuracy for class No (>90%) in all
training procedures. A reason for this effect may be that the majority of the
training data is not included in an extension and thus labelled as No. Using
a training set where the distribution of instances per class is more balanced,
counteracts this effect to some degree. Using benchmark data for testing leads
to an overall accuracy of about 63%. The decrease in accuracy in comparison to
the specifically generated test set might be due to graph sizes and types that are
unknown to the network model, as they were not included in the training data.
Moreover, a GCN’s classification process is very time efficient: the entire test
set (30,603 arguments) is classified in <0.5 s. For comparison: the SAT solver
CoQuiAAS takes about an hour for the same dataset.
Generally, neural networks seem to be suited to perform the task of classi-
fying arguments as “included in a preferred extension” or “not included in a
preferred extension”. After all, it did work to a certain degree. Nevertheless,
the chosen network architecture seems to be inadequate for the task of abstract
argumentation. It is quite possible that a different network architecture leads
to better results. For example, an increased number of layers in a network or
more neurons per layer may increase the network’s ability to learn more com-
plex features. The results gathered in this paper show signs of underfitting, so
a deeper network would be a plausible strategy. Besides, GCNs were originally
constructed to process undirected graphs, yet argumentation frameworks are
represented as directed graphs. If a better suited neural network is found, the
next step could be to expand the classification problem to a regression prob-
lem by training the network to predict entire extensions, or even all possible
extensions of an argumentation framework.

Acknowledgements. The research reported here was partially supported by the


Deutsche Forschungsgemeinschaft (grant KE 1686/3-1).

References
1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16,
265–283 (2016)
2. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod.
Phys. 74(1), 47 (2002)
3. Atkinson, K., et al.: Toward artificial argumentation. AI Mag. 38(3), 25–36 (2017)
4. Besnard, P., Hunter, A.: Constructing argument graphs with deductive arguments:
a tutorial. Argum. Comput. 5(1), 5–30 (2014)
5. Cerutti, F., Gaggl, S.A., Thimm, M., Wallner, J.P.: Foundations of implementa-
tions for formal argumentation. In: Baroni, P., Gabbay, D., Giacomin, M., van der
Torre, L. (eds.) Handbook of Formal Argumentation, chap. 15. College Publica-
tions, London (2018)
6. Cerutti, F., Giacomin, M., Vallati, M.: Generating challenging benchmark AFs. In:
COMMA, vol. 14, pp. 457–458 (2014)
7. Cerutti, F., Oren, N., Strass, H., Thimm, M., Vallati, M.: A benchmark framework
for a computational argumentation competition. In: COMMA, pp. 459–460 (2014)
Using Graph Convolutional Networks for Abstract Argumentation 37

8. Ding, B.N.K.L.: Neural Network Fundamentals with Graphs, Algorithms and


Applications. Mac Graw-Hill, New York (1996)
9. Dung, P.M.: On the acceptability of arguments and its fundamental role in non-
monotonic reasoning, logic programming and n-person games. Artif. Intell. 77(2),
321–357 (1995)
10. Dvořák, W., Dunne, P.E.: Computational problems in formal argumentation and
their complexity. In: Baroni, P., Gabbay, D., Giacomin, M., van der Torre, L.
(eds.) Handbook of Formal Argumentation, chap. 14. College Publications, London
(2018)
11. Erdos, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung.
Acad. Sci. 5(1), 17–60 (1960)
12. Gaggl, S.A., Linsbichler, T., Maratea, M., Woltran, S.: Summary report of the
second international competition on computational models of argumentation. AI
Mag. 39(4), 77–79 (2018)
13. Garcı́a, A.J., Simari, G.R.: Defeasible logic programming: DeLP-servers, contextual
queries, and explanations for answers. Argum. Comput. 5(1), 63–88 (2014)
14. Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)
– a review of applications in the atmospheric sciences. Atmos. Environ. 32(14),
2627–2636 (1998)
15. Hammond, D.K., Vandergheynst, P., Gribonval, R.: Wavelets on graphs via spec-
tral graph theory. Appl. Comput. Harmon. Anal. 30(2), 129–150 (2011)
16. Jain, A.K., Mao, J., Mohiuddin, K.M.: Artificial neural networks: a tutorial. Com-
puter 29(3), 31–44 (1996)
17. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
19. Lagniez, J.M., Lonca, E., Mailly, J.G.: CoQuiAAS: a constraint-based quick
abstract argumentation solver. In: 2015 IEEE 27th International Conference on
Tools with Artificial Intelligence (ICTAI), pp. 928–935. IEEE (2015)
20. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine learning, neural and statis-
tical classification. Citeseer (1994)
21. Modgil, S., Prakken, H.: The ASPIC+ framework for structured argumentation: a
tutorial. Argum. Comput. 5, 31–62 (2014)
22. Niu, D., Liu, L., Lü, S.: New stochastic local search approaches for computing pre-
ferred extensions of abstract argumentation. AI Commun. 31(4), 369–382 (2018)
23. Schmidt, R.F., Lang, F., Heckmann, M.: Physiologie des menschen: mit pathophys-
iologie. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-540-32910-7
24. Thimm, M.: Stochastic local search algorithms for abstract argumentation under
stable semantics. In: Modgil, S., Budzynska, K., Lawrence, J. (eds.) Proceedings of
the Seventh International Conference on Computational Models of Argumentation
(COMMA 2018). Frontiers in Artificial Intelligence and Applications, Warsaw,
Poland, vol. 305, pp. 169–180, September 2018
25. Thimm, M., Villata, S.: The first international competition on computational mod-
els of argumentation: results and analysis. Artif. Intell. 252, 267–294 (2017)
26. Toni, F.: A tutorial on assumption-based argumentation. Argum. Comput. 5(1),
89–117 (2014)
27. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
393(6684), 440 (1998)
The Hidden Elegance of Causal
Interaction Models

Silja Renooij1(B) and Linda C. van der Gaag1,2


1
Department of Information and Computing Sciences, Utrecht University, Utrecht,
The Netherlands
[email protected]
2
Dalle Molle Institute for Artificial Intelligence, Lugano, Switzerland
[email protected]

Abstract. Causal interaction models such as the noisy-or model, are


used in Bayesian networks to simplify probability acquisition for vari-
ables with large numbers of modelled causes. These models essentially
prescribe how to complete an exponentially large probability table from a
linear number of parameters. Yet, typically the full probability tables are
required for inference with Bayesian networks in which such interaction
models are used, although inference algorithms tailored to specific types
of network exist that can directly exploit the decomposition properties
of the interaction models. In this paper we revisit these decomposition
properties in view of general inference algorithms and demonstrate that
they allow an alternative representation of causal interaction models that
is quite concise, even with large numbers of causes involved. In addition
to forestalling the need of tailored algorithms, our alternative represen-
tation brings engineering benefits beyond those widely recognised.

Keywords: Bayesian networks · Causal interaction models ·


Maintenance robustness

1 Introduction

The use of causal interaction models has become popular as a technique for
simplifying probability acquisition upon building Bayesian networks for real-
world applications. These interaction models essentially impose specific patterns
of interaction among the causal influences on an effect variable, by means of a
parameterised conditional probability table for the latter variable. The number
of parameters involved in this table typically is linear in the number of causes
involved, where the full table itself is exponentially large in this number. Vari-
ous different causal interaction models have been designed for use in Bayesian
networks, the best known among which are the (leaky) noisy-or model and its
generalisations (see for example [4,11,17]).
While a causal interaction model describes a conditional probability table
for the effect variable in a causal mechanism by a linear number of parame-
ters, most software packages for inference with the embedding Bayesian network
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 38–51, 2019.
https://doi.org/10.1007/978-3-030-35514-2_4
The Hidden Elegance of Causal Interaction Models 39

require the fully specified table. This full probability table is then generated
from the parameters and the definition of the interaction model used, prior to
the inference. Using fully expanded probability tables is associated with two
serious disadvantages, however. Firstly, the size of the full table is exponential
in the number of cause variables involved in a causal mechanism, which induces
both the specification size of the network and the runtime complexity of infer-
ence to increase substantially. Secondly, using full tables has the engineering
disadvantage that the modelling decision to impose a specific pattern of causal
interaction is no longer explicit in the representation, as a consequence of which
the intricate dependencies between the cells of the table are effectively hidden.
For richly-connected Bayesian networks with large numbers of cause variables
per effect variable, as found for example from probabilistic relational models [7],
inference scales poorly and quickly becomes infeasible. Over the last decades
therefore, researchers have addressed ways to ameliorate the representational
and inferential complexity of using fully expanded probability tables with causal
interaction models. One such approach has focused on the design of tailored
inference algorithms for noisy-or Bayesian networks, which trade off general
applicability and runtime efficiency; these algorithms in essence exploit the struc-
tured specification of the noisy-or model for all variables upon inference (see
for example [5,6,8,12,15]). While experimental results underline their scalability
for noisy-or networks, these tailored algorithms are not easily integrated with
current algorithms for probabilistic inference in general. Another approach to
tackling the representational and inferential complexity of using fully expanded
probability tables for causal interaction models, has focused on the design of
more concise representations of causal mechanisms; these alternative represen-
tations in essence are distilled automatically from the interaction models at hand
and allow use of general inference algorithms (see for example [9,10,16,18,19]).
In this paper we reconsider and integrate some of the early work in which
causal mechanisms with interaction models are represented by alternative graph-
ical structures and probability tables. We demonstrate that interaction models
with specific decomposition properties can be represented efficiently by an alter-
native structure with associated small tables that have an intuitively appeal-
ing semantics. This alternative structure can be readily embedded in a general
Bayesian network and thereby allows for inference without the necessity of pre-
processing tables or using tailored algorithms. We further argue that this alter-
native representation induces elegant properties from an engineering perspective
which allow more ready maintenance and safer fine-tuning of parameters than
the use of fully expanded probability tables in causal mechanisms.
The paper is organised as follows. In Sect. 2, we briefly review causal inter-
action models, and the (leaky) noisy-or model more specifically. In Sect. 3, we
reconsider the partition of causal interaction models into a deterministic function
and associated independent noise variables, and demonstrate when and how the
underlying deterministic function can be decomposed. Based on these insights,
we derive our alternative cascading representation and study its properties in
Sect. 4. We conclude the paper in Sect. 5.
40 S. Renooij and L. C. van der Gaag

Fig. 1. A causal mechanism M(n) with n cause variables Ci and the effect variable E
(left); a conditional probability table imposed by the noisy-or model, for n = 3 (right).

2 Preliminaries
We briefly review causal interaction models for Bayesian networks and thereby
introduce our notational conventions. In this paper, we focus on binary random
variables, which are denoted by (possibly indexed) capital letters X. The values
of such a variable X are denoted by small letters; more specifically, we write x
and x to denote absence and presence, respectively, of the concept modelled by
X. (Sub)sets of variables are denoted by bold-face capital letters X and their
joint value combinations by bold-face small letters x; Ω(X) is used to denote
the domain of all value combinations of X. We further consider joint probability
distributions Pr over sets of variables, represented by a Bayesian network.
Within Bayesian networks, we consider causal 1 mechanisms M(n) composed
of a single effect variable E and one or more cause variables Ci , i = 1, . . . , n, with
arcs pointing to E; Fig. 1 (left) illustrates the basic idea of such a mechanism.
For the effect variable E of a causal mechanism, a conditional probability table is
specified, with distributions Pr(E | C) over E for each joint value combination c
for its set C of cause variables; this table thus specifies a number of distributions
that is exponential in the number of cause variables involved.
A causal interaction model for a causal mechanism M(n) takes the form of
a parameterised probability table for the effect variable involved. The noisy-or
model [17], which is the best known among these interaction models, defines the
conditional probability table for the effect variable E of M(n) through
– the conditional probability Pr(e | c̄1 , . . . , c̄n ) = 0;
– the parameters pi = Pr(e | c̄1 , . . . , c̄i−1
, ci , c̄i+1 , . . . , c̄n ), for all i = 1, . . . , n;
– the definitional rule Pr(e | c) = 1 − i∈Ic (1 − pi ) for the probabilities given
the remaining value combinations c involving the presence of two or more
causes, where Ic is the set of indices of the present causes ci in c.
Figure 1 (right) illustrates the parameterised table of the noisy-or model for a
mechanism with three cause variables. For a causal mechanism M(n), the model
1
Although we do not make any claim with respect to causal interpretation, we adopt
the terminology commonly used.
The Hidden Elegance of Causal Interaction Models 41

defines a full probability table over n + 1 variables, specifying a total of 2·2n


probabilities; half of these are derived from Pr(e | c) + Pr(e | c) = 1 and, hence,
are redundant. Of the 2n non-redundant probabilities, the noisy-or model allows
the values of only the n parameter probabilities pi to be chosen freely. The model
further forces the distribution Pr(E | c1 , . . . , cn ) to be degenerate.
Since its introduction, the noisy-or model has given rise to several variants
and generalisations (see [4] for an overview). Of these, we briefly review here
the leaky noisy-or model. This model differs from the noisy-or model in that
it includes an additional leak parameter pL = Pr(e | c̄1 , . . . , c̄n ) that captures
the probability of the effect e occurring in the absence of all modelled causes.
Different interpretations of the noisy-or parameters in view of this leak proba-
bility have given rise to different definitional rules for the remaining probabilities
[4,11]. Without loss of generality, we adopt in this paper theinterpretation pro-
posed by Dı́ez [4], and use the rule Pr(e | c) = 1 − (1 − pL ) · i∈Ic (1 − pi ) for the
probabilities given arbitrary joint value combinations c with multiple present
causes, where Ic again is the set of indices of the causes present in c.

3 Decomposition of Causal Interaction Models

Causal interaction models are often viewed as combining a deterministic function


f with independent noise variables Zi per cause variable (see for example [10,14,
17]); Fig. 2 (left) illustrates this view for the (leaky) noisy-or model. The noise
variables Zi are associated with the probabilities Pr(zi | ci ) = pi , Pr(zi | ci ) = 0,
where the pi are the model’s parameters; in the leaky variant of the noisy-or
model, the prior probability Pr(zL ) = pL for the designated noise variable ZL
is the leak parameter. The deterministic function f equals the logical or and
is encoded in the probability table Pr(E | Z) for the effect variable E through
degenerate distributions. The variable E thereby is a deterministic variable and,
by convention, is indicated by a double border in our figure. Slightly abusing
notation, we will further write E = f (Z).
The representation in Fig. 2 (left) was introduced originally to indicate how a
causal interaction model could ease the task of knowledge acquisition for causal
mechanisms involving large numbers of variables [9]: by making independence
of the causal influences explicit, the partition into a deterministic part and a
probabilistic noise part underlines the requirement of actually just a limited
number of parameters. While indeed easing the task of knowledge acquisition for
practical applications, the partition of a causal interaction model does not reduce
the actual size of its representation for use with general inference algorithms. In
fact, embedding the partition of a causal mechanism M(n) in a Bayesian network
will increase the total number of variables involved by n and still require the
specification of exponentially many probabilities for the effect variable E.
Specific types of causal interaction model however, actually do allow a
reduced representation [10]. More formally, it are specific decomposability prop-
erties of the deterministic function f that provide for a reduction of the size of the
conditional probability table(s) for the effect variable(s) in a causal mechanism.
42 S. Renooij and L. C. van der Gaag

Fig. 2. Partition of a causal interaction model into a probabilistic noise part and a
deterministic functional part (left); a chain decomposition for a commutative and asso-
ciative deterministic function (right).

Such decomposability properties of functions are widely used in mathematics


and computing science to simplify functions by their hidden structure: a function
f (·) on a set of entities is called self-decomposable if, for any two disjoint subsets
X, Y, the property f (X ∪ Y) = f (X)  f (Y) holds, for some commutative
and associative merge operator  (cf. [13]). Commutative and associative logical
operators, such as and and or, are self-decomposable Boolean functions. Now,
if the deterministic function f modelled for the effect variable E in the partition
in Fig. 2 (left) is self-decomposable, it can be split into a sequence of function
applications, each to a subset of E’s cause variables. Each such application can
then be described by an auxiliary effect variable Ei with fewer parents than E.
The set of auxiliary variables resulting from such a functional decomposition
can be organised in various different graphical structures. In this paper the
chained organisation from Fig. 2 (right) will be used and referred to as a chain
decomposition. We would like to note that the idea of introducing additional
variables to reduce the number of parents for a variable is a general modelling
technique for Bayesian networks, known as parent divorcing [16].
We consider again the partition of a causal interaction model into a proba-
bilistic part with noise variables Zi , i = 1, . . . , n, and a deterministic part E =
f (Z1 , . . . , Zn ) for some self-decomposable deterministic function f . The chain
decomposition of the model replaces the effect variable E of this partition by n
auxiliary variables Ei , i = 1, . . . , n, such that

– En has the noise variable Zn for its single parent and encodes the function
application En = f (Zn , I), where the variable I captures identity under f ;
– for all i = 1, . . . , n − 1, the variable Ei has Zi and Ei+1 for its parents and
encodes Ei = f (Zi , Ei+1 ).

If the interaction model includes a leak variable ZL the identity variable I in


the function application f (Zn , I) is replaced by ZL , to give En = f (Zn , ZL ).
We note that the number of variables in the chain decomposition has increased,
from 2·n+1 in the original partition, to 3·n. The total number of non-redundant
probabilities required for the probability tables for the variables Ei in the chain
The Hidden Elegance of Causal Interaction Models 43

Fig. 3. The cascading representation of a causal interaction model, which results from
marginalising out the noise variables Zi from its chain decomposition.

equals 4·n − 2 however, instead of the 2n probabilities required for the effect
variable E in the original partition. For an interaction model with a leak variable,
the number of required probabilities for the effect variable(s) is reduced from
2n+1 to 4·n. We will return to these observations in further detail in Sect. 4.
While the original motivation for partitioning causal interaction models was
to underline their induced ease of knowledge acquisition, Heckerman noted that
the introduction of the hidden noise variables Zi in fact made probability elic-
itation harder rather than easier, as “assessments are easier to elicit (and pre-
sumably more reliable) when a person makes them in terms of observable vari-
ables” [9]. Following this insight, he proposed a temporal interpretation of inde-
pendence of causal influences for causal interaction models in which a cause Ci is
assumed to occur (or not) at time i and has associated its own effect variable Ei
indicating the effect after the presence or absence of the first i causes have been
observed. With this temporal interpretation, the hidden noise variables are no
longer required and the effect variables Ei have in fact become observable vari-
ables with a clear semantics supporting probability elicitation. As noted already
by Heckerman himself, this temporal interpretation for causal interaction models
has reduced applicability for its main drawback [9,10].

4 Properties of a Cascading Representation


We propose a representation of causal interaction models that is quite similar to
Heckerman’s temporal representation, yet without the temporal interpretation.
We will argue that our representation has a clear semantics and in addition
allows for easy maintenance in the event of changes in the parameters of the
represented interaction model. Before demonstrating the latter in Sect. 4.2, we
now first detail our cascading representation of causal interaction models.

4.1 The Cascading Representation and Its Equivalence Property

We focus on causal mechanisms with an underlying self-decomposable determin-


istic function f as reviewed in the previous section, and consider their chain
decomposition as illustrated in Fig. 2 (right). Instead of building on a tempo-
ral interpretation as suggested by Heckerman, we propose to sum out the noise
44 S. Renooij and L. C. van der Gaag

variables Zi by marginalisation. We note that, by doing so, the effect variables


Ei , i = 1, . . . , n, become stochastic rather than deterministic. The resulting rep-
resentation, called the cascading representation of a causal interaction model, is
illustrated for a mechanism M(n) in Fig. 3, where
– the variable En , with the cause variable Cn for its single parent, has the
probability table derived from the chain decomposition as

Pr(en | Cn ) = Pr(en | zn ) · Pr(zn | Cn ) (1)
 ∈Ω(Z )
zn n

or, in the presence of a leak probability, as



Pr(en | Cn ) = pL · Pr(en | zn , zL ) · Pr(zn | Cn )
 ∈Ω(Z )
zn n

+ (1 − pL ) · Pr(en | zn , z L ) · Pr(zn | Cn ) (2)
 ∈Ω(Z )
zn n

– the variables Ei , i = 1, . . . , n − 1, with the parents Ci and Ei+1 , have the


probability table derived as

Pr(ei | Ci , Ei+1 ) = Pr(ei | zi , Ei+1 ) · Pr(zi | Ci ) (3)
zi ∈Ω(Zi )

We note that all probabilities conditioned on a value of a noise variable originate


from the degenerate distributions modelling the deterministic function f of the
interaction model. We further note that the inclusion of a leak probability affects
only the cells of the probability table for the variable En , whereas it affects,
through the definitional rule of the interaction model at hand, all cells in the
fully expanded table for the variable E in the causal mechanism.
To ensure that our cascading representation of an interaction model is equiv-
alent to its original representation in a causal mechanism, the variable E1 in our
representation should represent the exact same information as the effect variable
E in a mechanism M(n). Any probability Pr(e | c) = 1 − Pr(e | c) specified in
the full probability table for E should therefore be the same as the probability
Pr(e1 | c) that is computed from the cascading representation as

 n−1

Pr(e1 | c) = Pr(e1 | c1 , e2 ) · Pr(ek | ck , ek+1 ) · Pr(en | cn ) (4)
e− ∈Ω(E− ) k=2

where Ω(E− ) is the domain of the variable set E− = {E2 , . . . , En }, and where
ek ∈ Ω(Ek ), k = 2, . . . , n, is consistent with e− and ck ∈ Ω(Ck ), k = 1, . . . , n,
is consistent with c. We emphasize that we focus on the value e1 of the variable
E1 rather than on the value e1 , to simplify our arguments in the sequel.
We now illustrate the derivation of the probability tables for the cascading
representations of the noisy-or and leaky noisy-or models, and demonstrate
their equivalence to the standard causal-mechanism representation.
The Hidden Elegance of Causal Interaction Models 45

The cascading noisy-or. We begin with constructing the conditional probability


tables to be specified for the noisy-or model in its cascading representation. For
the variables Ei , i = 1, . . . , n − 1, we find from Eq. 3 that

Pr(ei | ci , ei+1 ) = 1 · 0 + 0 · 1 = 0
Pr(ei | ci , ei+1 ) = 1 · pi + 0 · (1 − pi ) = pi
Pr(ei | ci , ei+1 ) = 1 · 0 + 1 · 1 = 1
Pr(ei | ci , ei+1 ) = 1 · pi + 1 · (1 − pi ) = 1

where pi = Pr(e | c1 , . . . , ci−1 , ci , ci+1 , . . . , cn ) coincides with a regular noisy-or


parameter. For the variable En we similarly find from Eq. 2 that

Pr(en | cn ) = 1 · 0 + 0 · 1 = 0
Pr(en | cn ) = 1 · pn + 0 · (1 − pn ) = pn

where pn is again a regular noisy-or parameter. We observe that each parameter


pi , i = 1, . . . , n, occurs in the specification of exactly one table, which is the table
for the variable Ei . In addition to this single associated noisy-or parameter, the
probability table for the variable Ei further specifies just zeroes and ones.
We now show that the cascading representation, with the probability specifi-
cation above, correctly captures the noisy-or model. To this end, we observe that
for a summand of Eq. 4 to actually contribute to the computation of Pr(e1 | c),
it should be a product composed of just non-zero terms. Such non-zero terms
are found only with the following probabilities:
– Pr(en | cn ) or Pr(en | cn ), for the variable En ;
– Pr(ei | ci , ei+1 ), Pr(ei | ci , ei+1 ), and Pr(ei | ci , ei+1 ), for the variable Ei ,
i = 1, . . . , n − 1;
with ei ∈ Ω(Ej ) and ci ∈ Ω(Ci ), i = 1, . . . , n. Close examination of these non-
zero probabilities shows that for the value e1 of E1 under consideration, only
value combinations e− for E− = {E2 , . . . , En } consistent with e2 can possibly
contribute a non-zero term to a summand of Eq. 4. By iteratively applying this
argument to the variables E3 , . . . , En , we conclude that only the value com-
bination e− = e2 , . . . , en contributes a non-zero summand to the probability
Pr(e1 | c). For the cascading representation of the noisy-or model therefore,
Eq. 4 reduces to:

n−1

Pr(e1 | c) = Pr(ei | ci , ei+1 ) · Pr(en | cn ) (5)
i=1

To show that the cascading representation correctly captures the noisy-or


model, we now consider the three different cases distinguished by this model:
– Where the noisy-or model has Pr(e | c) = 0 for c = c1 , . . . , cn , we find in the
cascading representation from Pr(ej | cj , ej+1 ) = 1 for j = 1, . . . , n − 1 and
Pr(en | cn ) = 1, that Pr(e1 | c) = 1 and, hence, Pr(e1 | c) = 0.
46 S. Renooij and L. C. van der Gaag

Table 1. For the two representations of the noisy-or model for a causal mechanism
M(n): the number of variables (#variables), the number of non-redundant probabilities
for the effect variable(s) (#probabilities), and of those, the number of free parameters
to be acquired (#free) and the number of zeroes and ones (#0/1).

Representation #variables #probabilities #free #0/1


n
Full table n+1 2 n 1
Cascade 2·n 4·n − 2 n 3·n − 2

– Where the noisy-or model has Pr(e | c) = pi for c including the single
present cause ci , we have in the cascading representation that the product
term contributed for the variable Ei has the probability Pr(ei | ci , ei+1 ) =
1 − pi or, in case i = n, Pr(en | cn ) = 1 − pn . As all other terms in the product
of Eq. 5 equal 1, we find that Pr(e1 | c) = 1 − pi and, hence, Pr(e1 | c) = pi .
– For any value combination c including multiple present causes,
 with their
indices in Ic , the noisy-or model has Pr(e | c) = 1 − i∈Ic (1 − pi ). In
the cascading representation, the product term contributed by any Ej with
 the term by any Ei with i ∈ Ic is 1 − pi . We thus find
j ∈ Ic equals 1 and
that Pr(e1 | c) = i∈Ic (1 − pi ) and, hence, Pr(e1 | c) = 1 − i∈Ic (1 − pi ).

From the three cases above, we conclude that the cascading representation indeed
correctly captures the noisy-or model and, hence, that the cascading representa-
tion is equivalent with the fully expanded probability table for the effect variable
E in a causal mechanism with a noisy-or model.
The cascading representation of the noisy-or model is a more efficient rep-
resentation than a causal mechanism M(n) with a full probability table for the
effect variable E, despite the increase in number of variables to 2·n compared
to the n + 1 variables in the standard representation. More specifically, the cas-
cading representation requires 4·(n − 1) + 2 conditional probability distributions
in total for the variables Ei , of which 3·(n − 1) + 1 are degenerate. For ease of
reference, Table 1 summarises a comparison of the size of the cascading repre-
sentation with that of the standard representation. We note that the cascading
representation is more concise when a causal mechanism would include n ≥ 4
cause variables for the effect variable of interest.
The cascading leaky noisy-or. We now briefly address the cascading represen-
tation of the noisy-or model in the presence of a leak probability, which differs
from that of the standard noisy-or model only in the specification of the prob-
ability table for the variable En , which is derived from Eq. 2 as

Pr(en | cn ) = pL
Pr(en | cn ) = pL + pn · (1 − pL ) = 1 − (1 − pL ) · (1 − pn )

where pn is again a regular noisy-or parameter and pL is the leak probabil-


ity. To show that the cascading representation with this specification correctly
The Hidden Elegance of Causal Interaction Models 47

captures the leaky noisy-or model, we use Eq. 5 again, now for the different
cases distinguished by the leaky noisy-or model. We observe that, while with
the noisy-or model, the variable En would contribute to the product either
Pr(en | cn ) = 1 or Pr(en | cn ) = 1 − pn , it contributes either Pr(en | cn ) = 1 − pL
or Pr(en | cn ) = (1 − pL ) · (1 − pn ) in the cascading representation of the leaky
noisy-or model. As a consequence

– With the leaky model having Pr(e | c) = pL for c = c1 , . . . , cn , we find Pr(e1 |


c) = 1 − pL and, hence, Pr(e1 | c) = pL from the cascading representation.
– For any value combination c with an arbitrary number of  present causes with
indices in Ic , the leaky model has Pr(e | c) = 1−(1−pL )· i∈Ic (1−pi ). Using
the observation  above, we find in the cascading representationthat Pr(e1 |
c) = (1−pL )· j∈Ic (1−pj ) and, hence, Pr(e1 | c) = 1−(1−pL )· j∈Ic (1−pj ).

We conclude that the probabilities computed from the cascading representation


indeed coincide with the probabilities in the full probability table in a causal
mechanism with the leaky noisy-or model. We thus can construct an efficient
representation for a causal mechanism M(n) with the leaky noisy-or model. Of
the 4·(n − 1) + 2 conditional distributions required in total by the cascading
representation, now 3·(n − 1) are degenerate. We note that the difference of one
compared with the cascading representation of the noisy-or model originates
from the inclusion of the leak probability as a parameter.

4.2 Additional Engineering Benefits

Causal mechanisms are typically modelled straightforwardly in Bayesian net-


works, as in Fig. 1 (left). The different partitions and decompositions of causal
interaction models proposed, are mostly seen as alternative representations to
support probability elicitation and are hardly ever used in a network directly.
Table 1 clearly illustrates the reduction in specification size that would be
achieved by choosing a cascading representation for causal mechanisms with
large numbers of cause variables; as this representation limits the number of
parents per effect variable, it also has the potential to reduce the runtime com-
plexity of probabilistic inference, dependent of the graphical structure of the
embedding Bayesian network [10,14]. In this section, we now argue that the cas-
cading representation further has clear engineering benefits beyond those widely
recognised.

Clear semantics. Alternative representations of causal interaction models typi-


cally rely on the introduction of additional variables. Although introducing such
additional variables is commonly used for reducing the number of parents for an
effect variable, it is often quite undesirable from a knowledge engineering per-
spective. While the additional variables have a clear meaning from a mathematics
point of view, they often are quite meaningless from the perspective of the appli-
cation domain and thereby hamper the interpretation of the model as a domain
48 S. Renooij and L. C. van der Gaag

representation. The lack of a clear meaning is especially problematic if the prob-


abilities for these additional variables need be elicited from experts. Now, in our
cascading representation of a causal interaction model, the additional variables
do have a clear intuitive meaning, as a consequence of the decomposability prop-
erties of the underlying deterministic function: in the cascading representation of
a causal mechanism M(n), any variable Ei can be viewed as the effect variable
in the causal mechanism M(n − i + 1) involving the subset of causes Ci , . . . , Cn .
This claim is readily seen by replacing E1 by Ei in Eq. 4:

 n−1

Pr(ei | c) = Pr(ei | c1 , ei+1 ) · Pr(ek | ck , ek+1 ) · Pr(en | cn )
e− ∈Ω(E− ) k=i+1

where Ω(E− ) now is the domain of E− = {Ei+1 , . . . , En }, and ek , ck are defined
as before. As each variable Ei in the cascading representation represents the
effect variable in a (leaky) noisy-or model with the cause variables Ci , . . . , Cn , it
has an intuitive meaning that allows for explicit embedding of the representation
in a network without hampering interpretation and probability elicitation.

Maintenance robustness. The cascading representation of a causal interaction


model brings yet another advantage from an engineering perspective. When
using fully expanded probability tables for the effect variables in a Bayesian net-
work, any modelling decision to employ a causal interaction model is no longer
explicitly visible in the network’s representation. More specifically, the depen-
dency of multiple cells of the table on the parameters of the model employed is
hidden. When a network is maintained and adapted to its changing context of
application over a period of years therefore, inopportune changes to the speci-
fied probabilities can disrupt the modelled interaction pattern and, thereby, the
original modelling decision. We illustrate this observation by means of a causal
mechanism with a noisy-or model for the effect variable, and show that the cas-
cading representation of the interaction model used is more robust by preventing
the occurrence of such unintended disruptions.
We address the engineering task of studying the effects, on a network’s output
probabilities, of changing a single probability from one of the network’s prob-
ability tables. Such a sensitivity analysis is usually part of the encompassing
task of fine-tuning the network’s specification to attain a desired effect on the
output (see for example [1–3]). In view of a causal mechanism M(n), we now
consider the output probability of interest Pr(e | ci , ck ), for some 1 ≤ i < k < n,
and address how this probability changes with a change of the probability
x = Pr(e | c̄1 , . . . , c̄i−1 , ci , c̄i+1 , . . . , c̄n ) of the full probability table of the effect
variable E; we note that this probability is one of the parameters of the noisy-or
model. The function [Pr(e | ci , ck )] (x) describing the sensitivity of Pr(e | ci , ck )
to changes in x would be constant if the modelling choice of imposing a noisy-or
interaction for the mechanism at hand is not taken into consideration:
The Hidden Elegance of Causal Interaction Models 49


[Pr(e | ci , ck )] (x) = a, with a = Pr(e | ci , ck , c− ) · Pr(c− | ci , ck ) (6)
c− ∈Ω(C− )

where Ω(C− ) is the domain of the set C− = {C1 , . . . , Cn } \ {Ci , Ck } of cause


variables for which no value is fixed by the probability of interest. We note
that the computation of Pr(c− | ci , ck ) does not involve any probabilities from
the probability table of E; in contrast, the first term in the product for each
summand in a corresponds directly to a cell from the full table for E. Since the
summation does not involve parameter x directly, the analysis reveals that the
output probability is not sensitive to variations of the parameter. This result
however, does not correctly reflect the true sensitivity of the output probability
to variations in the parameter under study: the parameter x is actually included
in various cells of the full probability table of E by the definitional rule from the
noisy-or model, and thereby hidden in various summands of a.
We now consider the same sensitivity analysis in view of the cascading repre-
sentation of the noisy-or model, for essentially the same probability of interest
and essentially the same parameter probability. Recall that in the cascading rep-
resentation, any posterior probability distribution over the variable E1 equals the
posterior distribution given the same evidence over the original variable E with
the full probability table; we therefore take the probability Pr(e1 | ci , ck ) for the
probability of interest. The parameter pi = Pr(e | c̄1 , . . . , c̄i−1 , ci , c̄i+1 , . . . , c̄n )
of the noisy-or model moreover occurs as pi = Pr(ei | ci , ēi+1 ) in the model’s
cascading representation; we thus take x = Pr(ei | ci , ēi+1 ) as the probability
that will be varied. The sensitivity analysis will in essence establish the same
result as presented in Eq. 6, but now the probabilities Pr(e | ci , ck , c− ) follow
from the cascading representation using Eq. 5, and depend explicitly on x:

⎡ ⎤
  
Pr(e1 | ci , ck , c− ) (x) = ⎣(1 − pi ) · (1 − pk ) · (1 − pj )⎦ (x)
j∈Ic−

= (1 − x) · (1 − pk ) · (1 − pj )
j∈Ic−

where Ic− indexes all present causes in C− and, for ease of exposition, we again
focus on the value e1 for variable E1 . As a result, we find that
 
[Pr(e1 | ci , ck )] (x) = (1 − x) · (1 − pk ) · (1 − pj ) · Pr(c− | ci , ck )
c− ∈Ω(C− ) j∈Ic−

and conclude that the function [Pr(e1 | ci , ck )] (x) is in fact a linear function of
the form a · x + b with constants a, b, where
 
a = (1 − pk ) · Pr(c− | ci , ck ) (1 − pj )
c− ∈Ω(C− ) j∈Ic−

b= 1−a
50 S. Renooij and L. C. van der Gaag

The cascading representation of the noisy-or model performs, during inference,


the computation of the probabilities of the effect e given possible combinations of
causes. That is, application of the definitional rule is in essence left to inference,
resulting in the dependency of the output probability of interest on the noisy-or
parameter now being correctly taken into consideration. This observation fur-
ther demonstrates that, when changing a single parameter of the noisy-or model
specification upon fine-tuning a Bayesian network, in the cascading representa-
tion just a single cell of the conditional probability table for the appropriate effect
variable Ei needs to be adapted; in contrast, in the representation with a full
conditional probability table, various cells that are specified using the model’s
definitional rule will need adaptation. The cascading representation is therefore
easier to adapt without the risk of violating the properties of the underlying
causal interaction model.

5 Conclusions and Further Research


In this paper we revisited part of the large volume of work on causal interaction
models, and focused thereby on the representational complexity of such models.
We built on this early work for the purpose of demonstrating that some of these
models allow for a representation with various elegant properties that have not
been recognised until now. More specifically, by exploiting the property of self-
decomposability of the deterministic function underlying a causal interaction
model, we arrived at an alternative cascading representation that has a clear
intuitive semantics in terms of the causal mechanism itself, not requiring the
inclusion of artificial unobservable variables. In addition to well-known complex-
ity benefits of such alternative representations, this specific cascading representa-
tion has important knowledge engineering benefits, allowing easier maintenance
and more robust fine-tuning of parameters. As the compactness of the cascading
representation can be exploited directly by standard inference algorithms more-
over, we conclude all in all that this representation of causal interaction models
is quite suitable for explicit embedding in Bayesian networks.
While we used the (leaky) noisy-or model for our example causal interaction
model throughout the paper, the presented properties of the cascading represen-
tation apply straightforwardly to any interaction model involving binary-valued
variables and having an underlying self-decomposable deterministic function,
such as the (leaky) noisy-and model. For our further research we aim at extend-
ing our results to causal interaction models involving multi-valued variables, such
as the noisy-max model [5], and to other types of decomposable function.

References
1. Castillo, E., Gutiérrez, J.M., Hadi, A.S.: Sensitivity analysis in discrete Bayesian
networks. IEEE Trans. Syst. Man Cybern. 27, 412–423 (1997)
2. Chan, H., Darwiche, A.: Sensitivity analysis in Bayesian networks: From single
to multiple parameters. In: Halpern, J., Meek, C. (eds.) Proceedings of the 20th
Conference on Uncertainty in Artificial Intelligence, pp. 67–75 (2004)
The Hidden Elegance of Causal Interaction Models 51

3. Coupé, V.M.H., van der Gaag, L.C.: Properties of sensitivity analysis of Bayesian
belief networks. Ann. Math. Artif. Intell. 36, 323–356 (2002)
4. Dı́ez, F.J., Druzdzel, M.J.: Canonical Probabilistic Models for Knowledge Engi-
neering. Technical Report CISIAD-06-01 (2007)
5. Dı́ez, F.J., Galán, S.F.: Efficient computation for the noisy max. Int. J. Intell. Syst.
18, 165–177 (2003)
6. Frey, B.J., Patrascu, R., Jaakkola, T., Moran, J.: Sequentially fitting inclusive trees
for inference in noisy- or networks. In: Leen, T.K., Dietterich, T.G., Tresp, V. (eds.)
Advances in Neural Information Processing Systems 13, pp. 493–499. MIT Press,
Cambridge (2001)
7. Getoor, L.: Learning Statistical Models from Relational Data. PhD Thesis.
Stanford University (2001)
8. Heckerman, D.: A tractable inference algorithm for diagnosing multiple diseases.
In: Henrion, M., Kanal, L., Lemmer, J., Shachter, R. (eds.) Proceedings of the 5th
Conference on Uncertainty in Artificial Intelligence, pp. 163–172 (1989)
9. Heckerman, D.: Causal independence for knowledge acquisition and inference. In:
Heckerman, D., Mamdani, E. (eds.) Proceedings of the 9th Conference on Uncer-
tainty in Artificial Intelligence, pp. 122–127 (1993)
10. Heckerman, D., Breese, J.: Causal independence for probability assessment and
inference using Bayesian networks. IEEE Trans. Syst. Man Cybern. 26, 826–831
(1996)
11. Henrion, M.: Some practical issues in constructing belief networks. In: Kanal, L.N.,
Levitt, T.S., Lemmer, J.F. (eds.) Uncertainty in Artificial Intelligence 3, pp. 161–
173. Elsevier (1989)
12. Huang, K., Henrion, M.: Efficient search-based inference for noisy- or belief net-
works: TopEpsilon. In: Horvitz, E., Jensen, F. (eds.) Proceedings of the 12th Con-
ference on Uncertainty in Artificial Intelligence, pp. 325–331 (1996)
13. Jesus, P., Baquero, C., Almeida, P.S.: A survey of distributed data aggregation
algorithms. IEEE Commun. Surv. Tutorials 17, 381–404 (2011)
14. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. The MIT Press, Cambridge (2009)
15. Li, W., Poupart, P., van Beek, P.: Exploiting structure in weighted model counting
approaches to probabilistic inference. J. Artif. Intell. Res. 40, 729–765 (2011)
16. Olesen, K.G., et al.: A MUNIN network for the median nerve: a case study on
loops. Appl. Artif. Intell. 3, 385–403 (1989)
17. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, Burlington (1988)
18. del Sagrado, J., Salmerón, A.: Representing canonical models as probability trees.
In: Conejo, R., Urretavizcaya, M., Pérez-de-la-Cruz, J.-L. (eds.) CAEPIA/TTIA
-2003. LNCS (LNAI), vol. 3040, pp. 478–487. Springer, Heidelberg (2004). https://
doi.org/10.1007/978-3-540-25945-9 47
19. Zhang, N.L., Yan, L.: Independence of causal influence and clique tree propagation.
Int. J. Approximate Reasoning 19, 335–349 (1998)
Computational Models for Cumulative
Prospect Theory: Application
to the Knapsack Problem Under Risk

Hugo Martin and Patrice Perny(B)

Sorbonne Université, CNRS, LIP6, 75005 Paris, France


{hugo.martin,patrice.perny}@lip6.fr

Abstract. Cumulative Prospect Theory (CPT) is a well known model


introduced by Kahneman and Tversky in the context of decision mak-
ing under risk to overcome some descriptive limitations of Expected
Utility. In particular CPT makes it possible to account for the fram-
ing effect (outcomes are assessed positively or negatively relatively to a
reference point) and the fact that people often exhibit different risk atti-
tudes towards gains and losses. We study here computational aspects
related to the implementation of CPT for decision making in combi-
natorial domains. More precisely, we consider the Knapsack Problem
under Risk that consists of selecting the “best” subset of alternatives
(investments, projects, candidates) subject to a budget constraint. The
alternatives’ outcomes may be positive or negative (gains or losses) and
are uncertain due to the existence of several possible scenarios of known
probability. Preferences over admissible subsets are based on the CPT
model and we want to determine the CPT-optimal subset for a risk-averse
Decision Maker (DM). The problem requires to optimize a non-linear
function over a combinatorial domain. In the paper we introduce two
distinct computational models based on mixed-integer linear program-
ming to solve the problem. These models are implemented and tested
on randomly generated instances of different sizes to show the practical
efficiency of the proposed approach.

Keywords: Cumulative Prospect Theory · Knapsack Problem · Risk


aversion · Mixed-integer linear programming

1 Introduction

The increasing use of intelligent systems to support human decision-making or to


drive the actions of autonomous artificial agents shows the importance of devel-
oping expressive and adaptable models to support decision making activities in
complex environments. One of the major challenges is to improve our under-
standing and control over AI-based decisions, and also their relevance, fairness,
and alignment with the organisation’s values and risk proneness. In the field of
decision under risk, the main problem to overcome is to compare alternatives
the outcomes of which are known in probabilities, and to provide a control of
risk in the selection of optimal actions.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 52–65, 2019.
https://doi.org/10.1007/978-3-030-35514-2_5
Computational Models for CPT 53

Various mathematical models have been developed in Economics to account


from observed human behaviors in decision making under risk, since the seminal
works of von Neumann and Morgenstern [24] and Savage [19] on the foundations
of Expected Utility Theory (EU). Despite the intuitive appeal of EU theory, sev-
eral experiments have shown that sophisticated rational human behaviors are not
always explainable by EU theory. In particular the experiments conducted by
Kahneman and Tversky [7] have shown that violations of the Von Neumann and
Morgenstern independence axiom or violations of Savage’s Sure Thing Principle
are frequently observed, making it impossible to explain or simulate the observed
behaviors using EU. This has led to alternative models, relying on a deformation
of cumulative probabilities allowing to account for violations of the above men-
tionned independence axioms. For example, Yaari [25] proposed a dual model
to EU, based on a weighting function transforming probabilities rather than
a utility function transforming payoffs. A second example is Rank-dependent
Utility Theory (RDU) where both transformations (probabilities and payoff)
co-exist, thus providing a more general model including EU and Yaari as special
cases. Although these models provide more flexibility to model preferences and
decisions, they are more complex to handle for optimization purposes due to
their non-linearity (w.r.t probabilities and/or payoffs) and their parameters are
more complex to elicit. This issue has been considered in AI, in various topics
such as sequential decision making [5,6], state space search under risk [14], and
incremental preference elicitation [4,15].
Another aspect that is worth considering is that, in the field of decision under
risk, decision makers tend to think of outcomes relative to a certain reference
point (often the status quo). They care generally more about negative outcomes
(i.e. outcomes below the reference point) than positive ones (i.e. outcomes above
the reference point) and may exhibit different attitudes towards gains and losses.
This observation has motivated the development of Prospect Theory [7] and
Cumulative Prospect Theory (CPT) [23] that provide decision models able to
account for this phenomenon. CPT theory includes a sophistication where the
overall utility of a risky prospect is decomposed as the difference between an
aggregate of utilities of positive outcomes and an aggregate of utilities of negative
outcomes. The aggregation operation used for the positive side can be different
from the one used for the negative side, thus letting the possibility to describe
more sophisticated behaviors. Although the theory is well established, the use
of such models for optimization tasks under risk received less attention.
The aim of this paper is to contribute to fill the gap by proposing compu-
tational models based on CPT for the effective computation of CPT-optimal
solutions on combinatorial domains. For the sake of illustration we will con-
sider the problem of selecting projects under a budget constraint and under risk
(knapsack problem with multiple scenarios).
The paper is organized as follows: In Sect. 2, we briefly survey some related
work. Then, in Sect. 3, we recall some background on CPT and some important
results on modeling strong risk-aversion in CPT. In Sect. 4 we propose a first
linearization for the CPT model, relying on the notion of core of a capacity.
54 H. Martin and P. Perny

This leads us to propose a MIP formulation for the Knapsack problem under
risk. This model is tested on families of instances of different sizes. In Sect. 5 we
consider a special case where the probability weighting functions used in CPT
are piecewise linear with a bounded number of pieces. Under this assumption,
we propose another MIP formulation, more compact and easier to solve, for the
same problem.

2 Related Work
CPT was already used in AI, e.g., for developing a risk sensitive reinforcement
learning in a traffic signal control application [16]. CPT has also been used in a
number of decision support applications. For example, an application of CPT for
the multi-objective optimization of a bus network is proposed in [9]. However, in
this case study, the set of alternatives is explicitly defined and does not require
optimization techniques.
The Knapsack Problem (KP) under consideration in this paper consists in
selecting a subset of items under a budget constraint. This problem has some
links with the portfolio selection problem that can be seen as the continuous
relaxation of KP under risk. The application of CPT to portfolio selection and
insurance demand have been studied in finance (see e.g. [3]) with a computa-
tional model solvable under some specific assumptions (S-Shaped functions, risk
free reference point and/or linear utility functions). Beside CPT, several LP-
computational measures of dispersion are introduced to control the risk attached
to portfolios: let us mention the mean absolute deviation, the Gini’s mean dif-
ference (GMD) as basic LP computable risk measures, the worst realization
(Minimax) and the Conditional Value-at-Risk (CVaR) as basic LP computable
safety measures [10,11]. Moreover, in the latter reference, computational issues
related to the solution of portfolio models with integrity constraints are investi-
gated and a matheuristic called Kernel Search is proposed. These contributions
do not consider the use of bipolar valuation scales as in CPT.
In multicriteria analysis there is also an increasing interest for modeling dif-
ferent attitudes in the aggregation depending on whether evaluations are on the
positive or negative side. For example, the Choquet integral has been extended
to the bipolar case in [2,8] but optimization aspects attached to general bipo-
lar Choquet integral have not been investigated. Very recently, some LP-solvable
models have been proposed [12] for a subclass of bipolar Choquet integrals named
biOWA (for bipolar ordered weighted average). However, biOWA are symmetric
functions of their argument and do not allow to account for decision under risk
when scenarios have different probabilities. Finally an LP-solvable model was
proposed for a weighted extension of OWA operators [13] but does not consider
the case of bipolar scales. In this paper, we are going to introduce computational
models solvable by mixed-integer linear programming to determine CPT-optimal
solutions in implicit decision spaces.
Computational Models for CPT 55

3 CPT and Strong Risk Aversion


Let us consider a problem of decision making under risk with a finite set of states
of nature N = {s1 , . . . , sn }. The states represent possible scenarios under con-
sideration, impacting differently the outcomes of the alternatives. Let pi denote
the probability of state si . Any feasible alternative is seen as an act in the sense
of Savage. It is therefore characterized by a vector x = (x1 , . . . , xn ) where xi ∈ R
denotes the outcome of x in state si . In this context, the Rank-Dependent Utility
(RDU) model introduced in [17] is defined as follows:

Definition 1. Let x ∈ Rn be the outcome vector of an alternative, the RDU


model is defined by the following rank-dependent expected value:
n
 n n

 
fϕu (x) = ϕ( p(k) ) − ϕ( p(k) ) u(x(i) ) (1)
i=1 k=i k=i+1
n n
  
= u(x(i) ) − u(x(i−1) ) ϕ( p(k) ) (2)
i=1 k=i

where ϕ : [0, 1] → [0, 1] is a non-decreasing probability weighting function, u :


R → R is a non-decreasing real-valued utility function, and (.) is a permutation
defined on N and such that x(1) ≤ x(2) ≤ . . . ≤ x(n) .

Example 1. We consider three different scenarios s = (s1 , s2 , s3 ) of probability


p = ( 12 , 13 , 16 ) and we want to select the best solution in the set of alternatives
composed of x = (9, 4, 1), y = (4, 4, 4) and z = (1, 16, 1). We assume that
the preferences
 of the DM can be represented by RDU with ϕ(p) = p2 and
u(x) = (x). We have the following RDU value for the three alternatives:
– fϕu (x) = 1 + (u(4) − u(1)) × ϕ( 56 ) + (u(9) − u(4)) × ϕ( 12 ) = 1 + 25 1
36 + 4 = 36
70

– fϕu (y) = u(4) + (u(4) − u(4)) × ϕ( 56 ) + (u(4) − u(4)) × ϕ( 12 ) = u(4) + 0 = 2


– fϕu (z) = u(1) + (u(1) − u(1)) × ϕ( 12 ) + (u(16) − u(1)) × ϕ( 13 ) = 1 + 3 × 19 = 43
Thus, we have the following ranking of alternatives y  x  z where  is the
preference relation induced by fϕu .

This model clearly generalizes the Expected Utility model that can be
obtained for ϕ(p) = p for all p ∈ [0, 1]. Moreover it also includes the dual
model of EU known as Yaari’s model [25] as special case (when u is linear).
Nonetheless, this model is not always sufficient to account for decision behaviors
observed when decision makers think of outcomes relative to a certain reference
point. The utility scale is treated as an interval scale and preferences are not
impacted by positive affine transformations. Thus, 0 has no specific status in
the valuation scale, nor any other constant. This may prevent to account for
some sophisticated decision behaviors as illustrated in the following:

Example 2. We look for an optimal path from a source node to a sink node in a
network represented by a directed graph. The arcs of the graph are endowed with
56 H. Martin and P. Perny

vectors representing the algebraic payoff attached to the arc (which can represent
a gain or a loss) under two possible scenarios of equal probability. For example, the
valuation (−2, 3) means that the outcome will be a loss of 2 in scenario 1 and a
gain of 3 in scenario 2. Outcomes are assumed to be additive along a path and we
assume that u(z) = z. This problem can represent several situations (e.g., a path
planning problem or investment planning problem, both under uncertainty).
Let us consider two different instances of this problem, characterized by two
different graphs with nodes {s, a, b, t} and {s , c, d, t } respectively. The graphs
are presented below (Fig. 1).

Fig. 1. Graphs considered in Example 2

On the left handside, the upper and lower s-t-paths have utilities (9, 3) and
(5, 5) respectively. We assume here that the DM prefers the former path because
she maximizes the expected outcome when all evaluations are positive. In the
instance given on the right handside, the upper and lower s -t -paths respectively
have utilities (−1, −7) and (−5, −5). Here the DM may exhibit a more cautious
attitude towards risk due to the presence of negative outcomes. Let us assume
that she prefers the latter solution due to the fact that the outcome in the worst
case scenario is better. Hence, to model these preferences with RDU we must ful-
fill the following constraints: fϕ (9, 3) > fϕ (5, 5) and fϕ (−7, −1) < fϕ (−5, −5).
The former inequality implies that 3+ϕ( 12 )×(9−3) > 5 and therefore ϕ( 12 ) > 13 .
Moreover the latter inequality implies −7 + ϕ( 12 ) × (−1 + 7) < −5 and therefore
ϕ( 12 ) < 13 which yields a contradiction. Hence RDU is not able to represent the
observed preferences.
To overcome the descriptive limitations illustrated in the above example,
we consider now the Cumulative Prospect Theory model (CPT for short), first
introduced in [7].

Definition 2. Let x ∈ Rn be the outcome vector such that x(1) ≤ . . . ≤ x(j−1) <
0 ≤ x(j) ≤ . . . ≤ x(n) with j ∈ {0, . . . , n}, the Cumulative Prospect Theory is
characterized by the following evaluation function:
⎧ n n
⎪  

⎪ − p(k) ) if (i) ≥ (j)
n ⎪

ϕ( p (k) ) ϕ(
u k=i k=i+1
gϕ,ψ (x) = wi u(xi ) with wi = i i−1

⎪  
i=1 ⎪
⎪ ψ( p ) − ψ( p(k) ) if (i) < (j)
⎩ (k)
k=1 k=1
(3)
Computational Models for CPT 57

where ϕ and ψ are two real-valued increasing functions from [0, 1] to [0, 1] that
assign 0 to 0 and 1 to 1, and u is a continuous and increasing real-valued utility
function such that u(0) = 0 (hence u(x) and x have the same sign).
It can easily be checked that whenever ϕ(p) = 1 − ψ(1 − p) for all p ∈ [0, 1]
(duality) then CPT boils down to RDU. The use of non-dual probability weight-
ing functions ϕ and ψ depending on the sign of the outcomes under consideration
enables to model shifts of behavior relatively to the reference point (here 0). Let
us come back to Example 2 under the assumption that u(z) = z for all z ∈ R,
we have: gϕ,ψ (9, 3) = [ϕ(1) − ϕ( 12 )]3 + [ϕ( 12 ) − ϕ(0)]9 = 3 + 6ϕ( 12 ) since ϕ(0) = 0
and ϕ(1) = 1. Similarly gϕ,ψ (5, 5) = [ϕ(1) − ϕ( 12 )]5 + [ϕ( 12 ) − ϕ(0)]5 = 5. Hence
gϕ,ψ (9, 3) > gϕ,ψ (5, 5) implies ϕ( 12 ) > 13 (*).
On the other hand we have gϕ,ψ (−7, −1) = [ψ( 12 ) − ψ(0)](−7) + [ψ(1) −
ψ( 2 )](−1) = −1 − 6ψ( 12 ) since ψ(0) = 0 and ψ(1) = 1. Similarly gϕ,ψ (−5, −5) =
1

−5. Hence gϕ,ψ (−7, −1) < gϕ,ψ (−5, −5) implies ψ( 12 ) > 23 , which does not yield
any contradiction. Thus, the DM’s preferences can be modeled with gϕ,ψ .
As CPT boils down to RDU when ϕ(p) = 1 − ψ(1 − p) for all p ∈ [0, 1] it
is interesting to note that under this additional constraint ψ( 12 ) > 23 implies
ϕ( 12 ) < 13 which is incompatible with the constraint denoted (*) above, derived
from gϕ,ψ (9, 3) > gϕ,ψ (5, 5). This again illustrates the fact that RDU is not able
to describe such preferences.
Strong Risk Aversion in CPT. In many situations decision makers are risk-
averse. It is therefore useful to further specify CPT for risk-averse agents. We
consider here strong risk-aversion that is standardly defined from second-order
stochastic dominance. For any random variable X, let GX be the tail distribution
defined by GX (x) = P (X > x), with P a probability function. Let X, Y be two
random variables,
x X stochastically

x dominates Y at the second order if and only
if for all x ∈ X, −∞ GX (t)dt ≥ −∞ GY (t)dt. From this dominance relation, the
concept of mean-preserving spread standardly used to define risk aversion can
be introduced as follows. Y is said to derive from X using a mean preserving
spread if and only if E(X) = E(Y ) and X stochastically dominates Y at the
second order. We have then the following definition of strong risk aversion [18]:
Definition 3. Let  be a preference relation. Strong risk aversion holds for 
if and only if X  Y for all X and Y such that Y derives from X using a mean
preserving spread.
We recall now the set of conditions that CPT must fulfill to model strong
risk aversion. These conditions were first established in [21].
Theorem 1. Strong risk aversion holds in CPT if and only if ϕ is convex, ψ
is concave, u is concave for losses and also concave for gains, and the following
equation is satisfied:
 δ   δ 
u(x) − u(x − ) ψ(q + s) − ψ(s) ≥ u(y + ) − u(y) ϕ(p + r) − ϕ(r) (4)
q q
for all x ≥ 0 ≥ y and p, q, r, s such as p + q + r + s ≤ 1, p, q > 0 and r, s ≥ 0.
58 H. Martin and P. Perny

We remark that, when u(z) = z for all z, condition (4) can be rewritten in
the following simpler form: ψ(q+s)−ψ(s)q ≥ ϕ(p+r)−ϕ(r)
p for all p, q, r, s such as
p + q + r + s ≤ 1, p, q > 0 and r, s ≥ 0. In terms of derivative, this means that
ψ  (s) ≥ ϕ (r) for all r, s ≥ 0 such that r + s ≤ 1.
The above characterization of admissible forms of CPT for a risk-averse deci-
sion maker will be used in the next section to propose computational models for
the determination of CPT-optimal solutions on implicit sets. We conclude the
present section by making explicit a link between CPT and RDU model.
Linking RDU and CPT. Interestingly, CPT can be expressed as a difference
of two RDU values respectively applied to the positive and negative part of
the outcome vector x, using the two distinct probability weighting functions ϕ
and ψ. This reformulation is well known in the literature on rank-dependent
aggregation functions (see e.g., [2]) and reads as follows:
+ −
u
gϕ,ψ (x) = fϕu (x+ ) − fψu (x− ) (5)

where x+ = max(x, 0), x− = max(−x, 0), u+ (z) = u(z) if z ≥ 0 and 0 otherwise,


u− (−z) = −u(z) if z ≤ 0 and 0 otherwise. This formulation will be useful in the
next sections to propose linear reformulations of the CPT model.
The next sections are dedicated to the effective computation of CPT-optimal
solutions on an implicit set of alternatives using linear programming techniques.

4 A First Linearization for CPT Optimization


u
We present here a first mixed-integer program to maximize function gϕ,ψ (x)
under linear admissibility constraints for a risk-averse agent. By Theorem 1, we
know that ϕ must be convex and ψ must be concave to model risk aversion. These
properties will be useful to establish a linearization of the CPT model. For the
simplicity of presentation, we will also assume that u(x) = x and notations like
fϕu and gϕ,ψ
u
will be simplified into fϕ and gϕ,ψ . We will briefly explain later how
the proposed approach can be extended to the case of a piecewise linear utility
u. Let us first recall some notions linked to capacities and related concepts.
Capacities are set functions that are well known in decision theory for their
ability to describe non-additive representations of beliefs or importance in deci-
sion models. Let us recall the following:
Definition 4. A set function v : P(N ) → [0, 1] is said to be a capacity if it ver-
ifies: v(∅) = 0 and for all A, B ⊆ N, A ⊆ B ⇒ v(A) ≤ v(B). It is a normalized
capacity if v(N ) = 1.
Among all existing capacities, some are of particular interest. In particular,
a capacity v is said to be:

– convex if v(A ∪ B) + v(A ∩ B) ≥ v(A) + v(B) ∀A, B ⊆ N


– additive if v(A ∪ B) + v(A ∩ B) = v(A) + v(B) ∀A, B ⊆ N
Computational Models for CPT 59

When v is an additive capacity it can be simply characterized


by a vector
(v1 , . . . , vn ) of non-negative weights such that v(S) = i∈S vi for all S ⊆ N . In
the sequel we will indifferently use the same notation v for the capacity and for
the weighting vector characterizing the capacity.
Let P be any probability measure on 2N (N being the set of scenarios) and
ϕ any probability weighting function (continuous, non-decreasing and such that
= 0 and ϕ(1) = 1), then the set function defined by v(S) = (ϕ ◦ P )(S) =
ϕ(0)
ϕ( i∈S pi ) is a capacity. It is well known that v is convex if and only if ϕ is
convex [1]. When v is convex, a useful property is that there exists an additive
measure λ(S) that dominates function v [22]. The set of all additive capacities
dominating v is known as the core of v, formally defined as follows:
Definition 5. The core of a capacity v is the set of all additive capacities domi-
nating v, defined by core(v) = {λ : 2N → [0, 1] additive | λ(S) ≥ v(S) ∀S ⊆ N }.
Hence when ϕ is convex, v = ϕ ◦ P has a non empty core and v(S) =
minλ∈core(v) (λ(S)). In this case, a useful result due to Schmeidler [20] that holds
for general Choquet integrals used with a convex capacity implies that they can be
rewritten as the minimum of a set of linear aggregation functions. When applied to
fϕ (x) (which is an instance of the Choquet integral) the result writes as follows:
Proposition 1. If ϕ is convex we have fϕ (x) = min λ.x
λ∈core(ϕ◦P )

where fϕ is the Yaari’s model obtained from fϕu when u(z) = z for all z. Similarly,
for a concave weighting function ψ the dual defined by ψ̄(p) = 1 − ψ(1 − p) for
all p ∈ [0, 1] is convex and has a non-empty core. Hence Proposition 1 can be
used again to establish the following result:
Proposition 2. If ψ is concave we have fψ (x) = max λ.x
λ∈core(ψ̄◦P )

Proof. fψ (x) = −fψ̄ (−x) = − min λ.(−x) = max λ.x.


λ∈core(ψ̄◦P ) λ∈core(ψ̄◦P )

Using Propositions 1 and 2 and Eq. (5) we obtain a new formulation of CPT,
when ϕ and ψ are convex and concave respectively.
Proposition 3. Let x ∈ Rn . If ϕ is convex and ψ is concave then we have:

gϕ,ψ (x) = min λ · x+ − max λ · x−


λ∈core(ϕ◦P ) λ∈core(ψ̄◦P )

Now, let us show that this new formulation can be used to optimize gϕ,ψ (x)
using linear programming. From Propositions 1 and 2 the values of fϕ (x) and
fψ (x) for any outcome vector x ∈ Rn can be obtained as the solutions of the two
following linear programs respectively:

n
n
min λi xi max λi xi
i=1 i=1
ϕ(P (A)) ≤ λi ∀A ⊆ N ψ(P (A)) ≥ λi ∀A ⊆ N
i∈A i∈A
λi ≥ 0, i = 1, .., n λi ≥ 0, i = 1, .., n
60 H. Martin and P. Perny

The left LP given above directly derives from Proposition 1. The right LP
given the constraints ∀B ⊆
above derives from Proposition 2 after observing that
N, i∈B λi ≥ ψ̄(P (B)) are equivalent to ∀A ⊆ N, i∈A λi ≤ ψ(P (A)) (by
setting A = N \ B). Now, if we consider x as a variable vector, we consider the
dual formulations of the above LPs to get rid of the quadratic terms:

max ϕ(P (A)) × dA min ψ(P (A)) × dA
A⊆N A⊆N

dA ≤ xi i = 1, .., n dA ≥ xi i = 1, .., n
A⊆N :i∈A A⊆N :i∈A
dA ≥ 0 ∀A ⊆ N dA ≥ 0 ∀A ⊆ N

Finally, we obtain program P1 given below to optimize gϕ,ψ , with the assump-
tions that ϕ is convex, ψ is concave and that u(x) = x for all x ∈ Rn .

max ϕ(P (A)) × d+ A − ψ(P (A)) × d−
A
A⊆N A⊆N

d+A ≤ xi
+
⎪ i = 1, . . . , n




A⊆N :i∈A

⎪ d− ≥ x− i = 1, . . . , n

⎨ A⊆N :i∈A A i

(P1 ) xi = xi − xi
+ −
⎪ i = 1, . . . , n

⎪ ≤ +
≤ ×

⎪ 0 x i z i M i = 1, . . . , n

⎪ ≤ −
≤ − ×

⎩ 0 x i (1 z i ) M i = 1, . . . , n
x∈X
x− + + −
i , xi , dA , dA ≥ 0 i = 1, .., n, ∀A ⊆ N
zi ∈ {0, 1} i = 1, . . . , n

The integer variables zi , i = 1, . . . , n are used to decide whether xi is positive or


not. The M constant is used as usual to model disjunctive constraints depending
on the sign of xi . P1 has 2n+1 continuous variables, n binary variables and 5n
constraints. It can be specialized to solve any CPT-optimization problem, by
inserting the needed variables and constraints to define the set X. For example,
to solve the knapsack problem under risk, we have to insert m boolean m variables
yj (set to 1 iff object j is selected) subject to the constraint j=1 wj yj ≤ C,
for weights wj , j = 1, . . . , m and the knapsack capacity m C. Then variables xi
are linked to variables yj by equations of type xi = j=1 uij yj defining xi as a
linear utility over sets of objects for any scenario i ∈ {1, . . . , n}.
We implemented the above model using the Gurobi 7.5.2 solver on a computer
with 12 GB of RAM, a Intel(R) Core(TM) i7 CPU 950 @ 3.07 GHz processor.
Table 2 gives the results obtained for the CPT-knapsack problem modeled as fol-
lows: m represents the number of objects, n the number of voters; utilities uij and
weights wj were randomly generated
m in the range −10, 10 (resp. −100, 100),
the capacity is set to C = ( j=1 wj )/2, ϕ and ψ are randomly drawn to satisfy
the conditions of Proposition 1. Average times given in Table 2 are computed
over 20 runs, with a timeout set to 1200 s. We observe that this computational
model is able to solve instances with a large number of objects in a few seconds.
Nonetheless, it has an exponential number of continuous variables, which may
limit its applicability when the number of scenarios becomes larger. To over-
Computational Models for CPT 61

Table 1. Times (s) obtained by MIP P1 for the CPT-knapsack

m n=3 n=5 n=7


100 0.03 0.21 0.67
500 0.05 1.31 45.60
750 0.08 0.87 125.72
1000 0.13 3.28 150.48

come this limitation, we will know present a second computational model with a
polynomial number of variables and constraints, which optimizes gϕ,ψ (x) under
some additional assumptions concerning ϕ and ψ (Table 1).

5 The Case of Piecewise Linear Weighting Functions


From now on, we assume that ϕ and ψ are piecewise-linear functions with
respectively the breakpoints 0 = α0 ≤ α1 ≤ α2 ≤ . . . ≤ αt = 1 and
0 = β0 ≤ β1 ≤ β2 ≤ . . . ≤ βt = 1. This assumption is often made in differ-
ent contexts of elicitation and optimization. For example, Ogryczack [13] uses a
similar assumption to propose an efficient linearization of the WOWA operator.
We will follow a similar idea to propose a linearization for CPT.
A piecewise-linear function has its derivative constant on each interval. Thus

we define ϕ (u) = d+ 
i for all u ∈ [αi−1 , αi ] and ψ (u) = di for all u ∈ [βi−1 , βi ].
+ −
Moreover we assume that dt+1 = 0 and dt+1 = 0 for convenience. For any given
solution x, we define the cumulative function Fx , for all α ∈ [0, 1], by:
n
 
1 if xi ≤ α
Fx (α) = pi δi (α) with δi (α) =
0 otherwise
i=1

(−1)
Then we have Fx (u) = inf{y : Fx (y) ≥ u} returns the minimum perfor-
mance y such that the probability of scenarios whose performance is lower than
or equal to y is greater than or equal to u. Then, we define the tail function Gx ,
for all α ∈ [0, 1], by:
n
 
1 if xi > α
Gx (α) = pi δi (α) with δi (α) =
0 otherwise
i=1

(−1)
and Gx (u) = inf{y : Gx (y) ≤ u} returns the minimum performance y such
that the probability of scenarios whose performance level is greater than y is
lower than or equal to u. First, we observe that the following relation holds
(−1) (−1)
between Gx and Fx .

Proposition 4. For all x ∈ Rn and u ∈ [0, 1], Gx


(−1) (−1)
(u) = Fx (1 − u)
62 H. Martin and P. Perny

Proof. According to the definition of F and G, we have Gx (u) = 1 − Fx (u). We


(−1)
have then the following result Fx (1 − u) = inf{y : Fx (y) ≥ 1 − u} = inf{y :
(−1)
1 − Fx (y) ≤ u} = inf{y : Gx (y) ≤ u} = Gx (u) 

Then, let us show that these notions allow a new formulation of gϕ,ψ :
Proposition 5
t
   βi
 1−αi 
gϕ,ψ (x) = i+1 −di )
(d+ +
Fx(−1) (v)dv −(d− −
i −di+1 ) G(−1)
x (v)dv (6)
i=1 0 0

Proof. Let () be a permutation of scenarios such that x(1) ≤ x(2) ≤. . . ≤ x(n)
n
1  (−1)
and πi = k=i p(k) . Let E(x) = 0 Gx+ (u)ϕ (u) − Gx− (u)ψ  (u) du. First,
(−1)

we show that E(x) = gϕ,ψ (x).


 1 
Gx+ (u)ϕ (u) − Gx− (u)ψ  (u) du
(−1) (−1)
E(x) =
0
n 
 πi n 
 πi
Gx+ (u)ϕ (u)du − Gx− (u)ψ  (u)du
(−1) (−1)
=
i=1 πi+1 i=1 πi+1

(−1)
(i) for all u ∈ [πi+1 , πi ]. We have:
with πn+1 = 0. We notice that Gx+ (u) = x+

n  πi 
n  πi
= x+
(i) ϕ (u)du − x−
(i) ψ  (u)du
i=1 πi+1 i=1 πi+1
   

n n 
n 
n n 
n
= x+
(i) ϕ( p(k) ) − ϕ( p(k) ) − −
x(i) ψ( p(k) ) − ψ( p(k) )
i=1 k=i k=i+1 i=1 k=i k=i+1

= gϕ,ψ (x)

Then, the desired result can be obtained from another formulation of E(X):
 1 
Gx+ (u)ϕ (u) − Gx− (u)ψ  (u) du
(−1) (−1)
E(x) =
0
t 
 αi  βi
Gx+ (u)ϕ (u)du − Gx− (u)ψ  (u)du
(−1) (−1)
=
i=1 αi−1 βi−1

We recall that ϕ (u) = d+i for all u ∈ [αi−1 , αi ] (and dt+1 = 0 for convenience)
+
− −
and ψ  (u) = di for all u ∈ [βi−1 , βi ] (and dt+1 = 0 for convenience). We have:
   

t αi
(−1)
βi
(−1)
= d+
i Gx+ (u)du − d−
i Gx− (u)du
i=1 αi−1 βi−1
   

t αi
(−1)
βi
(−1)
= d+
i Fx+ (1 − u)du − d−
i Gx− (u)du (see Prop. 4)
i=1 αi−1 βi−1
   

t 1−αi−1
(−1)
βi
(−1)
= d+
i Fx+ (v)dv − d−
i Gx− (u)du (with v = 1 − u)
i=1 1−αi βi−1
Computational Models for CPT 63
    

t 1−αi−1
(−1)
1−αi
(−1)
βi
(−1)
= d+
i Fx+ (v)dv − Fx+ (v)dv − d−
i Gx− (u)du
i=1 0 0 βi−1
t

  1−αi  βi  βi−1
(−1) (−1) (−1)
= (d+ +
i+1 − di ) Fx+ (v)dv − d−
i Gx− (u)du − Gx− (u)du
i=1 0 0 0
t

  1−αi  βi
(−1) (−1)
= (d+
i+1 − d+
i ) Fx+ (v)dv − (d−
i − d−
i+1 ) Gx− (v)dv 
i=1 0 0

Now we introduce the two following linear programs to optimize

1−αk (−1)
α (−1)
Fx (v)dv and 0 k Gx (v)dv, for a fixed x and k. The lineariza-
0
p (−1)
tion of 0 Fx (v)dv has been first proposed in [13] and is here extended to

p (−1)
0
Gx (v)dv:

n n
min xi mi max xi mi
⎧ ni=1 ⎧ n i=1
⎨ m = (1 − α ) ⎨ m =α
i k i k
⎩ i=1 ⎩ i=1
mi ≤ p i i = 1, . . . , n mi ≤ p i i = 1, . . . , n
mi ≥ 0, i = 1, . . . , n mi ≥ 0, i = 1, . . . , n

Then we consider their respective dual formulations:


n
n
max(1 − αk )r − p i bi min αk r + p i bi
i=1 i=1
r − bi ≤ xi i = 1, . . . , n r + bi ≥ xi i = 1, . . . , n
bi ≥ 0, i = 1, . . . , n bi ≥ 0, i = 1, . . . , n
Using these formulations, we propose a mixed integer program (P2 ) to max-
imize gϕ,ψ (x) for any x belonging to a set X:

t  n t  n
max dk+ ((1 − αk ) × rk+ − p+l blk ) −
+
dk− (αk × rk− + p− −
l blk )
⎧ + k=1 l=1 k=1 l=1

⎪ r − b+ ik ≤ xi
+
i = 1, . . . , n, k = 1, . . . , t
⎪ k−
⎪ − −

⎪ r + b ≥ x i i = 1, . . . , n, k = 1, . . . , t
⎨ k ik

xi = x+ i − x i i = 1, . . . , n
(P2 )

⎪ 0 ≤ x+
i ≤ z i × M i = 1, . . . , n

⎪ −

⎪ 0 ≤ xi ≤ (1 − z i ) × M i = 1, . . . , n

x∈X

i , xi , bik ≥ 0, i = 1, . . . , n, k = 1, . . . , t
x+
zi ∈ {0, 1}, i = 1, . . . , n
 
− − −
with dk+ = d+ k+1 − dk and dk = dk − dk+1 for all k = 1, . . . , t. The integer
+

variables zi , i = 1, . . . , n are used to decide whether xi is positive or not. The


M constant is used as usual to model disjunctive constraints depending on the
sign of xi . P2 contains 2nt + 3n constraints, n binary variables and 2nt + 2n + 2t
continuous variables. It can be specialized to solve any CPT-optimal problem, by
inserting the needed variables and constraints to define the set X, as shown for
64 H. Martin and P. Perny

P1 . Table 2 gives the results obtained for the CPT-optimal knapsack problem.
Functions ϕ and ψ are chosen piecewise linear with n breakpoints; these functions
are randomly drawn to satisfy the conditions of Proposition 1. Average times
given in Table 2 are computed over 20 runs, with a timeout set to 1200 s.

Table 2. Times (s) obtained by MIP P2 for the CPT-knapsack

m n = 3 n = 5 n = 7 n = 10
100 0.01 0.03 0.07 0.12
500 0.04 0.13 0.19 28.22
750 0.03 0.18 2.76 107.36
1000 0.04 0.27 9.027 191.84

The linearization presented here for the case where u(z) = z for all z can
easily be extended to deal with piecewise linear concave utility functions u for
gains and for losses (admitting a bounded number of pieces). In this case, the
utility function can indeed be defined on gains as the minimum of a finite set of
linear utilities which enables a linear reformulation (the same holds for losses).
Note also that having a concave utility over gains and over losses is consistent
with the risk-averse attitude under consideration in the paper.

6 Conclusion
CPT is a well known model in the context of decision making under risk used
to overcome some descriptive limitations of both EU and RDU. In this paper,
we have proposed two mixed integer programs for the search of CPT-optimal
solutions on implicit sets of alternatives. We tested these computational mod-
els on randomly generated instances of the Knapsack problem involving up to
1000 objects and 10 scenarios. The second MIP formulation proposed performs
significantly better due to the additional restriction to piecewise linear utility
functions.
A natural extension of this work could be to address the exponential aspect of
our first formulation with a Branch&Price approach. Another natural extension
of this work could be to propose a similar approach for a general bipolar Choquet
integral where the capacity is not necessarily defined as a weighted probability.
It can easily be shown that the first linearization proposed in the paper still
applies to bi-polar Choquet integrals.

References
1. Chateauneuf, A.: On the use of capacities in modeling uncertainty aversion and
risk aversion. J. Math. Econ. 20(4), 343–369 (1991)
2. Grabisch, M., Marichal, J.L., Mesiar, R., Pap, E.: Aggregation Functions, vol. 127.
Cambridge University Press, Cambridge (2009)
Computational Models for CPT 65

3. He, X.D., Zhou, X.Y.: Portfolio choice under cumulative prospect theory: an ana-
lytical treatment. Manag. Sci. 57(2), 315–331 (2011)
4. Hines, G., Larson, K.: Preference elicitation for risky prospects. In: Proceedings of
the 9th International Conference on Autonomous Agents and Multiagent Systems:
volume 1, vol. 1, pp. 889–896. International Foundation for Autonomous Agents
and Multiagent Systems (2010)
5. Jaffray, J., Nielsen, T.: An operational approach to rational decision making based
on rank dependent utility. Eur. J. Oper. Res. 169(1), 226–246 (2006)
6. Jeantet, G., Spanjaard, O.: Computing rank dependent utility in graphical models
for sequential decision problems. Artif. Intell. 175(7–8), 1366–1389 (2011)
7. Kahneman, D., Tversky, A.: Prospect theory: an analysis of decision under risk.
Econometrica 47(2), 263–292 (1979)
8. Labreuche, C., Grabisch, M.: Generalized choquet-like aggregation functions for
handling bipolar scales. Eur. J. Oper. Res. 172(3), 931–955 (2006)
9. Li, X., Wang, W., Xu, C., Li, Z., Wang, B.: Multi-objective optimization of urban
bus network using cumulative prospect theory. J. Syst. Sci. Complex. 28(3), 661–
678 (2015)
10. Mansini, R., Ogryczak, W., Speranza, M.G.: Twenty years of linear programming
based portfolio optimization. Eur. J. Oper. Res. 234(2), 518–535 (2014)
11. Mansini, R., Ogryczak, W., Speranza, M.G.: Linear and Mixed Integer Program-
ming for Portfolio Optimization. EATOR. Springer, Cham (2015). https://doi.org/
10.1007/978-3-319-18482-1
12. Martin, H., Perny, P.: Biowa for preference aggregation with bipolar scales: appli-
cation to fair optimization in combinatorial domains. In: IJCAI (2019)
13. Ogryczak, W., Śliwiński, T.: On efficient wowa optimization for decision support
under risk. Int. J. Approximate Reasoning 50(6), 915–928 (2009)
14. Perny, P., Spanjaard, O., Storme, L.X.: State space search for risk-averse agents.
In: IJCAI, pp. 2353–2358 (2007)
15. Perny, P., Viappiani, P., Boukhatem, A.: Incremental preference elicitation for
decision making under risk with the rank-dependent utility model. In: Proceedings
of Uncertainty in Artificial Intelligence (2016)
16. Prashanth, L., Jie, C., Fu, M., Marcus, S., Szepesvári, C.: Cumulative prospect
theory meets reinforcement learning: prediction and control. In: International Con-
ference on Machine Learning, pp. 1406–1415 (2016)
17. Quiggin, J.: Generalized Expected Utility Theory - The Rank-dependent Model.
Kluwer Academic Publisher, Dordrecht (1993)
18. Rothschild, M., Stiglitz, J.E.: Increasing risk: I. A definition. J. Econ. Theory 2(3),
225–243 (1970)
19. Savage, L.J.: The Foundations of Statistics. J. Wiley and Sons, New-York (1954)
20. Schmeidler, D.: Integral representation without additivity. Proc. Am. Math. Soc.
97(2), 255–261 (1986)
21. Schmidt, U., Zank, H.: Risk aversion in cumulative prospect theory. Manag. Sci.
54(1), 208–216 (2008)
22. Shapley, L.: Cores of convex games. Int. J. Game Theory 1, 11–22 (1971)
23. Tversky, A., Kahneman, D.: Advances in prospect theory: cumulative representa-
tion of uncertainty. J. Risk Uncertainty 5(4), 297–323 (1992)
24. Von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior,
2nd edn. Princeton University Press, Princeton (1947)
25. Yaari, M.: The dual theory of choice under risk. Econometrica 55, 95–115 (1987)
On a New Evidential C-Means Algorithm
with Instance-Level Constraints

Jiarui Xie1,2(B) and Violaine Antoine1


1
Clermont Auvergne University, UMR 6158 CNRS, LIMOS,
63000 Clermont-Ferrand, France
[email protected], [email protected]
2
School of Computer Science and Technology, Harbin Institute of Technology,
Harbin 150001, People’s Republic of China

Abstract. Clustering is an unsupervised task whose performances can


be highly improved with background knowledge. As a consequence, sev-
eral semi-supervised clustering approaches have proposed to integrate
prior information in the form of constraints, generally at the instance-
level. Amongst them, evidential semi-supervised clustering algorithms,
such as CECM or SECM algorithm, rely on the theoretical foundation
of belief function which extends the probabilistic theory and allows us
to express many types of uncertainty about the assignment of an object
to a cluster. In this framework, no evidential clustering algorithm has
ever mixed different types of instance-level constraints. We propose here
to combine pairwise constraints and labeled data constraints in order
to better retrieve information from the background knowledge. The new
algorithm, called LPECM, shows good performances on synthetic and
real data sets.

Keywords: Labeled data constraints · Pairwise constraints ·


Instance-level constraints · Belief function · Evidential clustering ·
Semi-supervised clustering

1 Introduction

Clustering is a classical data analysis method that aims at creating natural


groups from a set of objects by assigning similar objects into the same cluster
while separating dissimilar objects into different clusters. Clustering solutions
can be expressed in the form of a partition. Amongst partitional clustering meth-
ods, some produce hard [6,18], fuzzy [10,19] and credal partitions [2–4,14]. A
hard partition assigns an object to a cluster with total certainty whereas a fuzzy
partition allows us to represent the class membership of an object in the form
of a probabilistic distribution. The credal partition, developed in the framework
of belief function theory, extends the concepts of hard and fuzzy partition. It
makes possible the representation of both uncertainty and imprecision regarding
the class membership of an object.
Clustering is a challenging task since various clustering solutions can be
valid although distinct. In order to lead clustering methods towards a specific
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 66–78, 2019.
https://doi.org/10.1007/978-3-030-35514-2_6
The LPECM Semi-clustering Algorithm 67

and desired solution, semi-supervised clustering algorithms integrate background


knowledge, generally in the form of instance-level constraints. In [2,3,19], labeled
data constraints are taken into account to improve the performances of the clus-
tering. In [4,6,10,18], two less informative constraints are introduced: the must-
link constraint, which specifies that two objects have to be in the same cluster
and the cannot-link constraint, which indicates that two objects should not be
assigned in the same cluster.
The combination of the three types of instance-level constraints can help
to retrieve as most information as possible and thus can achieve better per-
formances. However, there exists currently very few methods able to deal with
such constraints [17], more particularly, none generates a credal partition. In this
paper, we propose to associate two evidential semi-supervised clustering algo-
rithms, the first one handling pairwise constraints and the second one dealing
with labeled data constraints. The goal is to create a more general algorithm
that can obtain a large number of constraints from the background knowledge
and that can generate a credal partition.
The rest of the paper is organized as follows. Section 2 recalls the neces-
sary backgrounds about belief function, credal partition and evidential clustering
algorithms. Section 3 introduces the new algorithm named LPECM and presents
the objective function as well as the optimization steps. Several experiments are
produced in Sect. 4. Finally, Sect. 5 makes a conclusion about the work.

2 Background

2.1 Belief Function and Credal Partition

Evidence theory [15] (or belief function theory) is a mathematical framework


that enables to reflect the state of partial and uncertainty knowledge. Let X
be a data set composed of n objects such that xi ∈ Rp corresponds to the ith
object. Let Ω = {ω1 , . . . , ωc } be the set of possible clusters. The mass function
mik : 2Ω → [0, 1] applied on the instance xi measures the degree of belief that
the real class of xi belongs to a subset Ak ⊆ Ω. It satisfies:

mik = 1. (1)
Ak ⊆Ω

The collection M = [m1 , . . . , mn ] such that mi = (mik ) forms a credal partition


that is a generalization of a fuzzy partition. Indeed, any subset Ak such that
mik > 0 is named a focal set of mi . When all focal elements are singletons, the
mass function is equivalent to a probability distribution. If such situation occurs
for all objects, the credal partition M can be seen as a fuzzy partition.
Several transformations of a mass function mi are possible in order to extract
particular information. The plausibility function pl(A) : 2Ω → [0, 1] defined in
Eq. (2) corresponds to the maximal degree of belief that could be given to subset
A: 
pl(A) = m(Ak ), ∀A ⊆ Ω. (2)
Ak ∩A=∅
68 J. Xie and V. Antoine

To make a decision, a mass function can also be transformed into a pignistic


probability distribution [16]. Finally, a hard credal partition can be obtained by
assigning each object to the subset of cluster with the highest mass. This allows
us to easily detect objects located in an ambiguous region.

2.2 Evidential C-Means Algorithm


Evidential C-Means (ECM) [14] is the credibilistic version of Fuzzy C-Means
algorithm (FCM) [5]. In the FCM algorithm, each cluster is represented by a
point called centroid or prototype. The ECM algorithm, which generates a credal
partition, generalizes the cluster representation by considering a centroid vk in
Rp for each subset Ak ⊆ Ω. The objective function is:
n 
 n

α
JECM (M, V) = |Ak | mβik d2ik + ρ2 mβi∅ , (3)
i=1 Ak =∅ i=1

subject to

mik + mi∅ = 1 and mik ≥ 0 ∀i ∈ {1, . . . , n}. (4)
Ak ⊆Ω,Ak =∅

where |Ak | corresponds to the cardinality of the subset Ak , V is the set of


prototypes and d2ik represents the squared Euclidean distance between xi and
the centroid vk . Outliers are handled with masses mi∅ , ∀i ∈ 1, . . . , n, allocated
to the empty set and with the ρ2 > 0 parameter. The two parameters α ≥ 0 and
β > 1 are introduced to penalize the degree of belief assigned to subsets with
high cardinality and to control the fuzziness of the partition.
An extension of the ECM algorithm has been proposed in order to deal
with a Mahalanobis distance [4]. Such metric is adaptive and handles various
ellipsoidal shapes of clusters, giving more flexibility for the algorithm to better
find the inherent structure of the data. Mahalanobis distance d2ik between a point
xi and a subset Ak is defined as follows:
2 T
d2ik = xi − vk Sk = (xi − vk ) Sk (xi − vk ) , (5)

where Sk represent the evidential covariance matrix associated to subset Ak


and is calculated as the average of the covariance matrices of the singletons
included in subset Ak . Finally, objective function (3) has to be minimized with
the respect to the credal partition matrix M, the centroids matrix V and the
covariance matrix S = {S1 , . . . , Sc } the set composed of covariance matrices
dedicated to clusters.

2.3 Evidential Constrained C-Means Algorithm

Several evidential C-Means based algorithms have already been proposed [1–
4,8,13] to deal with background knowledge. For each of them, constraints are
The LPECM Semi-clustering Algorithm 69

expressed in the framework of a belief function and a term penalizing the con-
straints violation is incorporated in the objective function of the ECM algorithm.
In [2,3], labeled data constraints are introduced in the algorithms, i.e. the
expert can express the uncertainty about the label of an object by assigning it to
a subset. Objective functions of the algorithms are written in such a way that any
mass function which partially or fully respects a constraint on a specific subset
has a high weighted plausibility given to a singleton included in the subset.
r
 |Aj ∩ Al | 2
Tij = Ti (Aj ) = r mil , ∀i ∈ {1 . . . n}, Al ⊆ Ω, (6)
|Al |
Aj ∩Al =∅

r
|A ∩A | 2
where r ≥ 0 is a fixed parameter. Notice that if r = 0, then j|Al |rl = 1, which
implies that Tij is identical to the plausibility plij .
In [4], authors assumed that pairwise constraints (i.e. must-link and cannot-
link constraints) are available. A plausibility to belong or not to the same class
is then defined. This plausibility allows us to add a penalty term having high
values when there exists a high plausibility that two objects are (respectively are
not) in the same cluster although they have a must-link constraint (respectively
a cannot-link constraint).

pll×j (θ) = ml×j (Al × Aj )
{Al ×Aj ⊆Ω 2 |(Al ×Aj )∩θ=∅}
 (7)
= ml (Al )mj (Aj ),
Al ∩Aj =∅

pll×j (θ) = 1 − ml×j (∅) − bell×j (θ)


c
(8)
= 1 − ml×j (∅) − ml (Ak ) mj (Ak ) ,
k=1

where, θ denotes the event that objects xi and xj belong to the same class
corresponds to the subset {(ω1 , ω1 ), (ω2 , ω2 ), . . . , (ωk , ωk )} within Ω 2 , whereas
θ denotes the event that objects xi and xj do not belong to the same class
corresponds to its complement.

3 The LPECM Algorithm with Instance-Level


Constraints
3.1 Objective Function
We propose a new algorithm called Labeled and Pairwise constraints Eviden-
tial C-Means (LPECM), which is based on the ECM algorithm [14], handles
Mahalanobis distance and combines the advantages of pairwise constraints and
labeled data constraints by adding three penalty terms:

JLP ECM (M, V, S) = ξJECM (M, V, S) + γJM (M) + ηJC (M) + δJL (M), (9)
70 J. Xie and V. Antoine

with respect to constraints (4). Formulation of JECM corresponds to equation


(3) and (5), JM is a penalty term used for must-link constraints, JC is dedicated
to cannot-link constraints and JL handles labeled data constraints. Coefficients
ξ, γ, η and δ allow us to give more importance to the structure of the data, the
pairwise constraints or the labeled data constraints, respectively.
Penalty terms for pairwise constraints and labeled data constraints are
defined similarly to [2,4]:
⎛ ⎞
 
JM (M) = ⎝1 − (mi∅ + mj∅ − mi∅ mj∅ ) − mik mjk ⎠ ,
(xi ,xj )∈M Ak ⊆Ω,|Ak |=1
(10)
 
JC (M) = mik mjl , (11)
(xi ,xj )∈C Ak ∩Al =∅
⎛ ⎛ ⎞⎞
n r
   |Ak ∩ Al |
2
JL (M) = bik ⎝1 − ⎝ r mil ⎠⎠ , (12)
i=1 Ak ⊆Ω,Ak =∅
|Al |
Ak ∩Al =∅

where bik denotes whether the ith instance belongs to the subset Ak or not:

1 if xi is constrained to subset Ak ,
bik = (13)
0 otherwise.

It should be emphasized that in this study, unlike [2], each labeled object
is constrained to only one subset. Indeed, it makes more coherent the set of
constraints retrieved from the background knowledge. Constraints are gathered
in three different sets such that M corresponds to the set of must-link con-
straints, C to the set of cannot-link constraints and L denotes the labeled data
constraints set. The JM function returns the sum of the plausibilities that must-
link constrained objects to belong to the same class. Similarly, JC returns the
sum of the plausibilities that cannot-link constrained objects are not in the same
class. The JL term calculates for each labeled object a weighted plausibility to
belong to the label.

3.2 Optimization

The objective function is minimized as the ECM algorithm, i.e. by carrying out
an iterative scheme where first V and S are fixed to optimize M, second M and
S are fixed to optimize V and finally M and V are fixed to optimize S.

Centroids Optimization. It can be observed from (9) that the three penalty
terms included in the objective function of the LPECM algorithm do not depend
on the cluster centroids. Hence, the update scheme of V is identical to the ECM
algorithm [14].
The LPECM Semi-clustering Algorithm 71

Masses Optimization. In order to obtain a quadratic objective function with


linear constraints, we set parameter β = 2. A classical optimal approach can
then be used to solve the problem [7]. The following equations present how to
transform the objective function (9) in order to obtain a format accepted by
most usual quadratic optimization function.
Let us define mTi = (mi∅ , miω1 , . . . , miΩ ) the vector of masses for object xi .
The first term of JLP ECM is then:
n

JECM (M) = mTi Φi mi , (14)
i=1

where Φi = φikl is a diagonal matrix of size (2c × 2c ) associated to object xi
and defined such as:
⎧ 2
⎨ρ if Ak = Al and Ak = ∅,
α
φikl = d2ik |Ak | if Ak = Al and Ak
= ∅, (15)

0 otherwise.

Penalty term used for must-link constraints can be rewritten as follows:


  
JM (M) = nM + FTM mi + FTM mj + mTi ΔM mj , (16)
(xi ,xj )∈M (xi ,xj )∈M

where nM denotes
 M the number of must-link constraints, FM is a vector of size
2c and ΔM = δkl corresponds to a matrix (2c × 2c ) such that:

⎨ 1 if Ak = ∅ or Al = ∅,
FTM = [−1, 0, . . . , 0] and δkl M
= −1 if Ak = Al and |Ak | = |Al | = 1,
   ⎩
2c
0 otherwise.
(17)
The penalty term associated to cannot-link constraints is:

JC (M) = mTi ΔC mj , (18)
(xi ,xj )∈C

 C
where ΔC = δkl is a matrix (2c × 2c ) such that:

C 1 if Ak ∩ Al
= ∅,
δkl = (19)
0 otherwise.

Finally, the penalty term for the labeled data constraints is denoted as
follows:
n
JL (M) = nL − FTL mi , (20)
i=1
72 J. Xie and V. Antoine

where nL denotes the number of labeled data constraints and FL is a vector of


size 2c such that:

FTL = vikl clk , ∀Al ∈ Ω, (21)


r
|Ak ∩ Al |2
clk = r , (22)
|Al |

1 if (xi , Ak ) ∈ L and Ak ∩ Al
= ∅,
vikl = (23)
0 otherwise.

where expression (xi , Ak ) ∈ L means that the labeled data constraint on object
i is the subset Ak . Function vikl = {0, 1} equals to 1 for subsets Al that has an
intersection with Ak knowing the constraint  xi ∈ Ak .
Now, let us define mT = mT1 , . . . , mTn the vector of size n2c containing the
masses for each object and each subset, H a matrix of size (n2c × n2c ) and F a
vector of size n2c such that:
⎛ 1 ⎞
Φ Δ12 · · · Δ1n ⎧
⎜ Δ21 Φ2 · · · ⎟ ⎪ΔM , if (xi , xj ) ∈ M ,

⎜ ⎟
H=⎜ . .. . . . ⎟ , where Δij = ΔC , else if (xi , xj ) ∈ C ,
⎝ .. . . .. ⎠ ⎪

0, otherwise.
Δn1 · · · Φn
 (24)
FT = F1 · · · Fi · · · Fn , where Fi = ti FM − bi FL , (25)
 
1, if xi ∈ M , 1, if xi ∈ L ,
ti = , and bi = . (26)
0, otherwise. 0, otherwise.
Finally, the objective function (9) can be rewritten as follows:

JLP ECM (M) = mT Hm + FT m. (27)

3.3 Metric Optimization


It can be observed from (9), the three penalty terms of the LPECM algorithm
objective function do not depend on the Mahalanobis distance. Since the set of
metric S only appears in JECM , the update method is identical to the ECM
algorithm [4]. The overall procedure of the LPECM algorithm is summarized in
Algorithm 1.

4 Experiments
4.1 Experimental Protocols
Performances and time consumption of the LPECM algorithm have been tested
on a toy data set and several classical data sets from UCI Machine Learning
Repository [9]. For the Letters data set, we kept only the three letters {I,J,L}
as done in [6]. As in [14], fixed parameters associated to the ECM algorithm
The LPECM Semi-clustering Algorithm 73

Algorithm 1. The LPECM algorithm with an adaptive metric


Require: c: Number of desired clusters; X = (x1 , . . . , xn ) the data set; C : Set of
cannot-link constraints ; M : Set of must-link constraints ; L : Set of labeled data
constraints ;
Ensure: credal partition matrix M, centroids matrix V, distance metric matrix S
1: Initialization of V ;
2: repeat
3: Calculate the new credal partition matrix M by solving the quadratic program-
ming problem defined by (27) subject to (4);
4: Calculate the new centroids matrix V by solving the linear system defined as
in the ECM algorithm [14];
5: Calculate the new metric matrix S and new associated distances using [4];
6: until No significant change in V between two successive iterations;

were set such as α = 1, β = 2 and ρ2 = 100. In order to balance the importance


of the data structure, must-link constraints, cannot-link constraints and labeled
data constraints respectively, we respectively set ξ = n21 c , γ = |M1 | , η = |C1 | and
δ = |L1 | as coefficients.
Experiment on a data set consists of 20 simulations with a random selection
of the constraints. For each simulation, five runs of the LPECM algorithm with
random initialization of the centroids are performed. Then, in order to avoid
local optimum, the clustering solution with the minimum value of the objective
function is selected.
The accuracy of the obtained credal partition is measured with the Adjusted
Rand Index (ARI) [12], which is the corrected-for-chance version of the Rand
Index that compares a hard partition with the true partition of a data set.
As a consequence, the credal partition generated by the LPECM algorithm is
first transformed into a fuzzy partition using the pignistic transformation and
then the maximum of probability on each object is retrieved to obtain a hard
partition.

4.2 Toy Data Set


In order to show the interest of the LPECM algorithm, we started our experi-
ments with a tiny synthetic data set composed of 15 objects and three classes.
Figure 1 presents the hard credal partition obtained using the ECM algorithm.
Big cross marks denote the centroid of each cluster. Centroids for subsets with
higher cardinalities are not represented to ease the reading. As it can be observed,
objects located between two clusters are assigned in subsets with cardinality
equal to two. Notice also that, due to the stochastic initialization of the cen-
troids, there may exist a small difference between the results obtained from
every execution of the ECM algorithm. After the addition of background knowl-
edge in the form of must-link constraints, cannot-link constraints and labeled
data constraints and the execution of the LPECM algorithm with a Euclidean
distance, it is interesting to observe that previous uncertainties have vanished.
74 J. Xie and V. Antoine

Fig. 1. Hard credal partition obtained Fig. 2. Hard credal partition obtained
on Toy data set with the ECM algo- on Toy data set with the LPECM algo-
rithm rithm

Figure 2 presents the hard credal partition obtained. The magenta dashed line
describes cannot-link constraints, the light green solid line represents must-link
constraints and the circled point corresponds to the labeled data constraints .
Figure 3 illustrates, for the execution of the LPECM algorithm, the mass
distribution for singletons with respect to the point numbers, allowing us a more
distinct sight of the masses allocations. Table 1 displays the accuracy as well as
time consumption for the ECM algorithm and the LPECM algorithm when first
only the cannot-link constraint is incorporated, second when the cannot-link
and the must-link constraint are introduced (Cannot-Must-Link line in Table 1),
finally when all constraints are added (Cannot-Must-Labeled line in Table 1).
Our results demonstrate that the combination of pairwise constraints and labeled
data constraints improved the performance of the semi-clustering algorithm with
tolerable time consumption. As expected, the more constraints are added, the
better are the performance.

4.3 Real Data Sets

The LPECM algorithm has been tested on three known data sets from the
UCI Machine Learning Repository namely Iris, Glass, and Wdbc and a derived
Letters data set from UCI. Table 2 indicates for each data set its number of
objects, its number of attributes and its number of classes.
For each data set, we randomly created 5%, 8%, and 10% of each type of
constraints out of the whole objects, leading to a total of 15%, 24%, and 30%
of constraints. As an example, Fig. 4 shows the hard credal partition obtained
with the Iris data set after executing the LPECM algorithm with a Mahalanobis
distance and 24% of constraints in total. As can be observed, all the constrained
objects are clustered with certainty in a singleton. Ellipses represent the covari-
ance matrices obtained for each cluster.
The LPECM Semi-clustering Algorithm 75

Table 1. Performance obtained on toy


data set with the LPECM algorithm

ARI Time(s)
ECM 0.60 0.07
LPECM-Cannot 0.68 0.41
LPECM-C-Must 0.85 0.29
LPECM-C-M-labeled 1.00 0.22
Fig. 3. Mass curve obtained on Toy data
set with the LPECM algorithm

Tables 3 and 4 illustrate for all data sets the accuracy results with a Euclidean
and a Mahalanobis distance respectively when the different percentage of con-
straints are employed. Mean and standard deviation are calculated over 20 sim-
ulations. As it can be observed, incorporating constraints lead most of the time
to significant improvement of the clustering solution. Using a Mahalanobis dis-
tance particularly help to achieve better accuracy than using a Euclidean dis-
tance. Indeed, the Mahalanobis distance corresponds to an adaptive metric giv-
ing more freedom than a Euclidean distance to respect the constraints while
finding a coherent data structure.

Table 2. Description of the data sets from


UCIMLR

Name Objects Attributes Clusters


Iris 150 4 3
Letters 227 16 3
Wdbc 569 31 2
Glass 214 10 3
Fig. 4. Hard credal partition obtained
on Iris data set with the LPECM algo-
rithm

For the time consumption, as it can be observed from Fig. 5, (1) Adding con-
straints gives higher computation time than no constraints. (2) most of the time,
the more constraints are added, the less time is needed to finish the computation.
76 J. Xie and V. Antoine

Table 3. LPECM’s performance (ARI) with Euclidean distance

ECM LPECM
5.00% 8.00% 10.00%
Iris 0.59 ± 0.00 0.70 ± 0.01 0.71 ± 0.00 0.70 ± 0.01
Letters 0.04 ± 0.01 0.09 ± 0.03 0.09 ± 0.04 0.10 ± 0.02
Wdbc 0.67 ± 0.00 0.71 ± 0.00 0.71 ± 0.01 0.71 ± 0.00
Glass 0.59 ± 0.07 0.60 ± 0.07 0.62 ± 0.06 0.65 ± 0.08

Table 4. LPECM’s performance (ARI) with Mahalanobis distance

ECM LPECM
5.00% 8.00% 10.00%
Iris 0.67 ± 0.01 0.71 ± 0.05 0.82 ± 0.01 0.83 ± 0.04
Letters 0.08 ± 0.01 0.45 ± 0.03 0.47 ± 0.02 0.60 ± 0.05
Wdbc 0.73 ± 0.02 0.74 ± 0.03 0.75 ± 0.02 0.77 ± 0.05
Glass 0.56 ± 0.03 0.60 ± 0.03 0.65 ± 0.02 0.65 ± 0.03

Fig. 5. Time consumption (CPU) of the LPECM algorithm with Euclidean distance

5 Conclusion

In this paper, we introduced a new algorithm named Labeled and Pairwise


constraints Evidential C-Means (LPECM). It generates a credal partition and
mixes three main types of instance-level constraints together, allowing us to
retrieve more constraints from the background knowledge than other semi-
supervised clustering algorithms. In addition, the framework of belief function
employed in our algorithm allows us (1) to represent doubts for the labeled
data constraints (2) to clearly express, with the credal partition as a result,
The LPECM Semi-clustering Algorithm 77

the uncertainties about the class memberships of the objects. Experiments show
that the LPECM algorithm does obtain better accuracy with the introduction
of constraints, particularly with a Mahalanobis distance. Further investigations
have to be performed to fine-tune parameters and to study the influence of
the constraints on the clustering solution. The LPECM algorithm can also be
applied for a real application to show the interest in gathering various types
of constraints. In this framework, active learning schemes, which automatically
retrieve few informative constraints with the help of an expert, are interesting
to study. Finally, in order to scale and fast the LPECM algorithm, a new mini-
mization process can be developed by relaxing some optimization constraints.

References
1. Antoine, V., Quost, B., Masson, M.H., Denœux, T.: CEVCLUS: evidential clus-
tering with instance-level constraints for relational data. Soft Comput. - Fusion
Found. Methodol. Appl. 18(7), 1321–1335 (2014)
2. Antoine, V., Gravouil, K., Labroche, N.: On evidential clustering with partial
supervision. In: Destercke, S., Denoeux, T., Cuzzolin, F., Martin, A. (eds.) BELIEF
2018. LNCS (LNAI), vol. 11069, pp. 14–21. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-99383-6 3
3. Antoine, V., Labroche, N., Vu, V.V.: Evidential seed-based semi-supervised clus-
tering. In: International Symposium on Soft Computing & Intelligent Systems,
Kitakyushu, Japan (2014)
4. Antoine, V., Quost, B., Masson, M.H., Denœux, T.: CECM: constrained evidential
C-means algorithm. Comput. Stat. Data Anal. 56(4), 894–914 (2012)
5. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy C-means clustering algorithm.
Comput. Geosci. 10(2–3), 191–203 (1984)
6. Bilenko, M., Basu, S., Mooney, R.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proceedings of the Twenty-First International
Conference on Machine Learning. ACM New York, NY, USA (2004)
7. Coleman, T.F., Li, Y.: A reflective newton method for minimizing a quadratic
function subject to bounds on some of the variables. SIAM J. Optim. 6(4), 1040–
1058 (1996)
8. Denœux, T.: Evidential clustering of large dissimilarity data. Knowl.-Based Syst.
106(C), 179–195 (2016)
9. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.
edu/ml
10. Grira, N., Crucianu, M., Boujemaa, N.: Active semi-supervised fuzzy clustering.
Pattern Recogn. 41(5), 1834–1844 (2008)
11. Gustafson, D.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix.
In: IEEE Conference on Decision & Control Including the Symposium on Adaptive
Processes, New Orleans, LA (2007)
12. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
13. Li, F., Li, S., Denoeux, T.: k-CEVCLUS: constrained evidential clustering of large
dissimilarity data. Knowl.-Based Syst. 142, 29–44 (2018)
14. Masson, M.H., Denœux, T.: ECM: an evidential version of the fuzzy C-means
algorithm. Pattern Recogn. 41(4), 1384–1397 (2008)
15. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
78 J. Xie and V. Antoine

16. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66, 191–234
(1994)
17. Vu, V.V., Do, H.Q., Dang, V.T., Do, N.T.: An efficient density-based clustering
with side information and active learning: a case study for facial expression recog-
nition task. Intell. Data Anal. 23(1), 227–240 (2019)
18. Wagstaff, K., Cardie, C., Rogers, S., Schrœdl, S.: Constrained k-means cluster-
ing with background knowledge. In: Proceedings of the Eighteenth International
Conference on Machine Learning (ICML), Williamstown, MA, USA, vol. 1, pp.
577–584 (2001)
19. Zhang, H., Lu, J.: Semi-supervised fuzzy clustering: a kernel-based approach.
Knowl.-Based Syst. 22(6), 477–481 (2009)
Hybrid Reasoning on a Bipolar
Argumentation Framework

Tatsuki Kawasaki, Sosuke Moriguchi, and Kazuko Takahashi(B)

Kwansei Gakuin University, 2-1 Gakuen, Sanda, Hyogo 669-1337, Japan


{dxk96093,ktaka}@kwansei.ac.jp, [email protected]

Abstract. We develop a method of reasoning using an incrementally


constructed bipolar argumentation framework (BAF) aiming to apply
computational argumentation to legal reasoning. A BAF that explains
the judgment of a certain case is constructed based on the user’s knowl-
edge and recognition. More specifically, a set of effective laws are derived
as the conclusions from evidential facts recognized by the user, in a
bottom-up manner; conversely, the evidences required to derive a new
conclusion are identified if certain conditions are added, in a top-down
manner. The BAF is incrementally constructed by repeated exercise of
this bidirectional reasoning. The method provides support for those who
are not familiar with the law, so that they can understand the judgment
process and identify strategies that might allow them to win their case.

Keywords: Argumentation · Bidirectional reasoning · Legal reasoning

1 Introduction
An argumentation framework (AF) is a powerful tool in the context of incon-
sistent knowledge [15,21]. There are several possible application areas of AFs,
including law [4,20]. To date, research on applications has focused principally
on AF updating to yield an acceptable set of facts when a new argument is
presented, and strategies to win the argumentation when all of the dialog paths
are known. However, in real legal cases, an AF representing a law in its entirety
is usually incompletely grasped at the initial stage. Thus, it is more realistic to
construct the AF incrementally; recognized facts are added in combination with
AF reasoning.
For example, consider a case in which a person leased her house to another
person, and the lessee then sub-leased a room to his sister; the lessor now wants
to cancel the contract. (This is a simplified version of the case discussed in
Satoh et al. [23].) The lessor decides to prosecute the lessee. The lessor knows
that there was a lease, that they handed over the house to the lessee, and that
the room was handed over by the lessee to the sublessee. However, if the lessor
is not familiar with the law, she does not know what law might be applicable to
her circumstances or what additional facts should be proven to make it effective.
In addition, laws commonly include exceptions; that is, a law is effective if certain
conditions are satisfied provided there is no exception.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 79–92, 2019.
https://doi.org/10.1007/978-3-030-35514-2_7
80 T. Kawasaki et al.

For example, if there is no abuse of confidence, then the law of cancellation is


not effective. Therefore, the lessor should check that there is “no abuse of confi-
dence,” as well as regarding facts that prove what must be proven. In addition,
other facts may be needed to prove that there has been no abuse of confidence.
Also, the presence of an exception may render another law effective. For those
lacking a legal background, it can be difficult to grasp the entire structure of
a particular law, which may be extensive and complicated. Thus, s/he often
consults with, or even fully delegates the problem-solving process to, a lawyer.
However, if the argumentation structure of the law was clear, s/he would be
more likely to adequately understand the judgment process, obviating the need
for a lawyer.
In this paper, we develop a bidirectional reasoning method using a bipolar
argumentation framework (BAF) [2] that is applicable to legal reasoning. In a
BAF, a general rule is represented as a support relation, and an exception as an
attack relation. The facts of a case become arguments that are not attacked or
supported by other arguments.
We explore the BAF in both a bottom-up and top-down manner, search for
effective laws based on proven facts, and identify the facts required for applica-
tion to other laws.
Beginning with the user-recognized facts of a specific case, laws that may
be effective are searched for using a bottom-up process. Next, new conclusions
are considered if specific conditions are satisfied. If such conclusions exist, the
required facts are then identified in a top-down manner, so that the conditions
are satisfied. If the existence of such facts can be proven, the facts are added as
evidence, and the next round then begins. The procedure terminates if the user is
satisfied with the conclusions obtained, or if no new conclusions are derived. By
repeating this process, a user can simulate and scrutinize the judgment process
to identify a strategy that may allow them to win the case.
This paper is organized as follows. In Sect. 2, we present the BAF, and the
semantics thereof. In Sect. 3, we describe how the law is interpreted and rep-
resented using a BAF. In Sect. 4, we show the reasoning process of a BAF. In
Sect. 5, we discuss related works. Finally, in Sect. 6, we present our conclusions
and describe our planned future work.

2 Bipolar Argumentation Framework

A BAF is an extension of an AF in which the two relations of attack and support


are defined over a set of arguments [2]. We define a support relation between a
power set of arguments and a set of arguments; this differs from the common
support relation of a BAF, so that it corresponds to a legal structure.

Definition 1 (bipolar argumentation framework). A BAF is defined as a


triple AR, ATT , SUP , where AR is a finite set of arguments, ATT ⊆ AR ×AR
and SUP ⊆ (2AR \ {∅}) × AR. If (B, A) ∈ ATT , then B attacks A; if (A, A) ∈
SUP , then A supports A.
Hybrid Reasoning on a Bipolar Argumentation Framework 81

A BAF can be regarded as a directed graph where the nodes and edges
correspond to the arguments and the relations, respectively. Below, we represent
a BAF graphically; a simple solid arrow indicates a support relation, and a
straight arrow with a cutting edge indicates an attack relation. The dashed
rectangle shows a set of arguments supporting a certain argument; it is sometimes
omitted if the supporting set is a singleton.

Example 1. Figure 1 is a graphical representation of a BAF


{a, b, c, d, e}, {(b, a), (e, d)}, {({c, d}, a)}.

Fig. 1. Example of BAF.

Definition 2 (leaf ). An argument that is neither attacked nor supported by


any other argument in a BAF is said to be a leaf of the BAF.

For a BAF AR, ATT , SUP , let → be a binary relation over AR as follows:

→= ATT ∪ {(A, B)|∃A ⊆ AR, A ∈ A ∧ (A, B) ∈ SUP }.

Definition 3 (acyclic). A BAF AR, ATT , SUP  is said to be acyclic if there


is no A ∈ AR such that (A, A) ∈→+ , where →+ is a transitive closure of →.

We define semantics for the BAF based on labeling [9]. Usually, labeling is
a function from a set of arguments to {in, out, undec}, but undec is unneces-
sary here because we consider only acyclic BAFs. An argument labeled in is
considered an acceptable argument.

Definition 4 (labeling). For a BAF AR, ATT , SUP , a labeling L is a func-


tion from AR to {in, out}.

Labeling of a set of arguments proceeds as follows: L(A) = in if L(A) = in


for all A ∈ A; and L(A) = out otherwise.

Definition 5 (complete labeling). For a BAF baf = AR, ATT , SUP , label-
ing L is complete iff the following conditions are satisfied: for any argument
A ∈ AR, (i) L(A) = in if A is a leaf or (∀B ∈ AR; (B, A) ∈ ATT ⇒ L(B) =
out) ∧ (∃A ∈ 2AR ; (A, A) ∈ SUP ∧ L(A) = in), (ii) L(A) = out otherwise.

If an argument is both attacked and supported, the attack is taken to be


stronger than the support. For any acyclic BAF, there is exactly one complete
labeling.
82 T. Kawasaki et al.

3 Description of Legal Knowledge in a BAF


In this paper, we consider an application of the Japanese civil code.
We assume that the BAFs are acyclic and that each law features both general
rules and exceptions. A law is effective if the conditions of the general rule are
satisfied unless an exception holds. We construct a BAF in which each condition
in a rule is represented by an argument; the general rules can be represented by
support relations, and the exceptions by attack relations. Therefore, our inter-
pretations of attack and support relations differ from those used in the other
BAFs. First, a support relation is defined as a binary relation of a power set and
a set of arguments, since if one of the conditions is not met, the law is ineffec-
tive. Second, an argument lacking support is labeled out, even if it is attacked
by an argument labeled out, since a law is not defined only by its exceptions
and any argument other than a leaf should have an argument that supports it.
The correspondence between the “acceptance” criterion of our BAF and that of
a logic program is shown in [17].
We assume that the entire set of laws can be represented by a BAF termed
a universal BAF, denoted as follows:
ubaf = UAR, UATT , USUP .
It is almost impossible for a person who is not an expert to understand all
of the laws. Therefore, we construct a specific BAF for each incident; relevant
evidential facts are disclosed, and applicable laws identified using the universal
BAF.
Definition 6 (existence/absence argument). For an argument A, an argu-
ment showing the existence of an evidential fact for A is termed an existence
argument and is denoted by ex(A); and an argument showing the absence of an
evidential fact for A is termed an absence argument and is denoted by ab(A).
These arguments are abbreviated as ex/ab arguments, respectively.
Definition 7 (consistent ex/ab arguments set). For a set of ex/ab argu-
ments S, if there does not exist an argument A that satisfies both ex(A) ∈ S and
ab(A) ∈ S, then S is said to be consistent.
Example 2. Figure 2 shows a BAF for the house lease case shown in Sect. 1,
together with the relevant ex/ab arguments.
In this Figure, ex(a1), ex(a2), and ex(a4) are existence arguments for agree-
ment of lease contract, handover to lessee, and handover to sublessee, respec-
tively; ab(b1) is an absence argument for fact of non abuse of confidence; and
no evidence is currently shown for the other leaves.

4 Reasoning Using the BAF


4.1 Outline
We employ a running example throughout this section.
Hybrid Reasoning on a Bipolar Argumentation Framework 83

Fig. 2. Example of a BAF for a house-lease case.

Example 3. We assume the existence of the universal BAF ubaf = UAR,


UATT , USUP  shown in Fig. 3.

Fig. 3. Example of a universal BAF ubaf .

Let Ex be a set of ex/ab arguments that is currently recognized by a user.


For either ex(A) or ab(A) of Ex , A ∈ UAR, and A is a leaf in ubaf .
Initially, a user recognizes a set of facts related to a certain incident. The
reasoning proceeds by repeating two methods in turn. The first is used is to derive
conclusions from the facts in a bottom-up manner, and the other is employed
to find the evidence needed to draw a new conclusion if certain other conditions
are met, this exercise proceeds in a top-down manner.
84 T. Kawasaki et al.

4.2 Bottom-Up Reasoning


In bottom-up reasoning, arguments are derived by following the support relations
from an ex/ab argument. The algorithm is shown in Algorithm 1.

Algorithm 1. BUP: find conclusions


Let Ex be a set of ex/ab arguments and AR = {A|ex(A) ∈ Ex } ∪ {A|ab(A) ∈ Ex }.
Find a pair of a set of arguments A ⊆ AR and an argument A ∈ UAR \ AR such
that (A, A) ∈ USUP .
while there exists such a pair (A, A) do
Set AR = AR ∪ {A}.
end while
Set SUP = USUP ∩ (2AR × AR) ∪ {({ex(A)}, A)|ex(A) ∈ Ex }.
Set ATT = UATT ∩ (AR × AR) ∪ {(ab(A), A)|ab(A) ∈ Ex }. Set AR = AR ∪ Ex .
Apply the complete labeling L to baf = AR, ATT , SUP .
Concl (Ex ) = {A | L(A) = in ∧ ¬∃(A, B) ∈ SUP ; A ∈ A ⊆ AR}.
return Concl (Ex ).

The resulting set of conclusions is the set of arguments that are acceptable,
and no more conclusions can be drawn from the currently known facts.
Example 4 (Cont’d). Let Ex be {ex(a1), ex(b1), ex(c1), ex(d1)}, and ubaf be a
BAF in Fig. 3. Then, the BAF can be constructed using the process shown in
Fig. 4(a) and (b); finally, baf 1 is obtained, and Concl (Ex) = {a, e} is derived
as the set of conclusions. The complete labeling of the BAF baf 1 is shown in
Fig. 4(c).

(a) (b) (c)

Fig. 4. The bottom-up reasoning used to construct baf 1 .

4.3 Top-Down Reasoning


On the other hand, we can seek additional facts that must be proven if a new
conclusion is to be derived. Here, we search for a new conclusion and a set of
supports, and identify the facts required to derive the arguments of the set.
Hybrid Reasoning on a Bipolar Argumentation Framework 85

Definition 8 (differential support pair, differential supporting set


of arguments, differentially supported argument). For a BAF
baf = AR, ATT , SUP , if (A ∩ AR) = ∅ ∧ (A ∩ AR) = A ∧ (A, A) ∈ USUP ,
then (A \ AR, A) is said to be a differential support pair on baf . In addition,
A \ AR and A are said to be a differential supporting set of arguments on baf ,
and a differentially supported argument on baf , respectively.

Intuitively, differential support pair means that A cannot be derived using


the current BAF due to the lack of required conditions, but it can be derived if
all of the arguments in A \ AR are accepted. In general, there may exist several
differential support pairs on any BAF.

Example 5 (Cont’d). For baf 1 , we find differential support pair ({f, g}, l),
because {e, f, g} ∩ AR = {e} = ∅ and ({e, f, g}, l) ∈ USUP (Fig. 5).

Fig. 5. A differential support pair on baf 1 : ({f, g}, l).

For a BAF baf = AR, ATT , SUP  and an argument A ∈ AR, we detect a set
of facts that satisfies L(A) = in. For an argument A, we check the conditions for
labeling of the arguments that attack A and the sets of arguments that support
A. This is achieved by repeatedly applying the following two algorithms: PC (A)
and NC (A), which are shown in Algorithms 2 and 3, respectively. Note that
there is no argument that both lacks support and is attacked.
Then, discovery of the required facts proceeds using the algorithm shown in
Algorithm 4.
As a result, a set of ex/ab arguments is generated. An existence argument
ex(A) shows that the fact is required if L(A) = in is to hold, whereas an absence
argument ab(A) shows that the evidence is an obstacle to prove L(A) = in.

Example 6 (Cont’d). For a differential supporting set of arguments {f, g}, we


find Sol({f, g}) = P C(f ) ∪ P C(g).

(i) P C(f ) = {ex(f )}.


(ii) P C(g) = P C(h) = P C(h1) ∪ N C(j) (Fig. 6). As for P C(h1), we obtain
{ex(h1)}. As for N C(j), we have two alternatives: N C(j1) and P C(k)
(Fig. 7).
86 T. Kawasaki et al.

Algorithm 2. PC (A): find required arguments for L(A) = in.


Let A be an argument in UAR.
if A is a leaf of ubaf then
Sol (A) = {ex(A)}.
else
 A that satisfies (A, A) ∈ USUP .
Choose anarbitrary set of arguments
Sol (A) = (B,A)∈UATT NC (B) ∪ Ai ∈A PC (Ai ).
end if
return Sol (A).

Algorithm 3. NC (A): find required arguments for L(A) = out.


Let A be an argument in UAR.
if A is a leaf of ubaf then
Sol (A) = {ab(A)}.
else
Choose an arbitrary argument B that satisfies (B, A) ∈ UATT .
Let A1 , . . . , An be all sets of arguments such that (Ai , A) ∈ USUP (i = 1, . . . , n).
Choose an arbitrary argument Ai ∈ Ai (i = 1, . . . , n).
Either Sol (A) =  PC (B)
or Sol (A) = i=1,...,n NC (Ai ).
end if
return Sol (A).

Assume that we choose the condition N C(j1). Then, we find {ab(j1)} as


Sol(j1) (Fig. 8). Finally, we obtain a set of required facts {ex(f ), ex(h1), ab(j1)}
(Fig. 9).

4.4 Hybrid Reasoning

The algorithm used for hybrid reasoning is Algorithm 5.


As a result, the required facts are identified, and conclusions are derived from
these facts.

Example 7 (Cont’d). For a set of required facts {ex(f ), ex(h1), ab(j1)}, assume
that a user has confirmed the existence of f and h1, and the absence of j1.
Then, we construct a new BAF baf 2 in a bottom-up manner from this set. Part
of the labeling of baf 2 is shown in Fig. 10. Finally, we obtain a new conclusion
set Concl = {a, i, l}.

Algorithm 4. TDN: find required facts


Let baf = AR, ATT , SUP  be a BAF and A a differential supporting set of argu-
ments on baf
 .
Sol (A) = A∈A PC (A).
return Sol (A).
Hybrid Reasoning on a Bipolar Argumentation Framework 87

Fig. 6. Top-down reasoning: Both N C(j) Fig. 7. Top-down reasoning: Either


and P C(h1) are required. N C(j1) or P C(k) is required.

Fig. 8. Top-down reasoning: The situation when choosing N C(j1).

The hybrid algorithm is nondeterministic at several steps and there are mul-
tiple possible solutions.

Example 8 (Cont’d). Assume that we choose the condition P C(k) in Fig. 7.


Then, we find {ex(k1)} as Sol(k1), and the set of required facts is
{ex(f ), ex(h1), ex(k1)}. In this case, we construct the different BAF baf 2 shown
in Fig. 11 after a second round of bottom-up reasoning. Strictly speaking, an
argument j and the attack relations (k, j) and (j, h) do not appear in baf 2
because a new argument is created by tracing only a support relation in BUP.
However, it is reasonable to show the attack relation traced in the TDN, consid-
ering that the BAF is constructed based on the user’s current knowledge. Note
that these attacks do not affect the label L(h) = in.

4.5 Correctness

We now prove the validity of hybrid reasoning.


In the proof, we use the height of an argument in ubaf , as defined in [17].

Definition 9. For the acyclic universal BAF ubaf , the height of an argument
A is defined as follows:

– If A is a leaf, then the height of A is 0.


– If there are some arguments B such that (B, A) ∈→, then the height of A is
h + 1, where h is the maximum height of this B.
88 T. Kawasaki et al.

Fig. 9. Top-down reasoning: Ex = {ex(f ), ex(h1), ab(j1)}.

Algorithm 5. HR: hybrid reasoning


Let Ex be a set of ex/ab arguments in an initial state.
For Ex , obtain Concl (Ex ) and a new baf by BUP.
while a user does not attain a goal that satisfies him/her, and TDN returns a
consistent solution with Ex do
For a baf and an arbitrary A on baf , obtain Sol (A) by TDN.
for each ex(A) or ab(A) in Sol(A) do
Ask the user to confirm that existence or absence.
if there exists a fact for A then
Set Ex = Ex ∪ {ex(A)}.
else
Set Ex = Ex ∪ {ab(A)}.
end if
Get Concl (Ex ) and a new baf by BUP.
end for
end while

It is easy to show that the heights of arguments are definable when ubaf is
acyclic.
Here, we prove two specifications, one for a BUP, and the other for a TDN.
For a BUP, the built BAF includes arguments pertaining to the evidential facts
that the user recognizes. Notably, the acceptability of such arguments is the
same as that of the universal BAF.

Theorem 1. Assume that ubaf is acyclic. Let baf be built by BUP from Ex .
When UEx is defined as {ex(A)|ex(A) ∈ Ex } ∪ {ab(A)|A is a leaf of ubaf ∧
ex(A) ∈ Ex }, and LU is a complete labeling for UAR ∪ UEx , UATT ∪
{(ab(A), A)|ab(A) ∈ UEx }, USUP ∪{({ex(A)}, A)|ex(A) ∈ UEx }, for any argu-
ment A ∈ UAR, A ∈ AR ∧ L(A) = LU (A), or A ∈ AR ∧ LU (A) = out.

Proof. We prove this by induction on the height of A. When A is a leaf, if


ex(A) ∈ Ex (i.e., ex(A) ∈ UEx ), then A ∈ AR and L(A) = LU (A) = in. If
ex(A) ∈ Ex (i.e., ab(A) ∈ UEx ), then LU (A) = out. In this case, if ab(A) ∈ Ex
Hybrid Reasoning on a Bipolar Argumentation Framework 89

Fig. 10. The BAF obtained after the second round of bottom-up reasoning: baf 2 .

Fig. 11. The BAF obtained after the second round of bottom-up reasoning: baf 2 .

then A ∈ AR but L(A) = out, and, otherwise A ∈ AR. Both cases satisfy the
proposition.
Assume that A is not a leaf. If LU (A) = in, then there are some supports
(A, A) ∈ USUP such that LU (A) = in, and any attacks (B, A) ∈ UATT ,
LU (B) = out. From the induction hypothesis, for any C ∈ A, C ∈ AR, and
L(C) = in; and for any attackers B of A, L(B) = out or B ∈ AR. The definition
of BUP immediately shows that A ∈ AR, and therefore L(A) = in = LU (A).
Assume that LU (A) = out. If A ∈ AR, the proposition is satisfied. Otherwise,
A ∈ AR, and from the definition of BUP, there are some supports (A, A) such
that A ⊆ AR, so A is not a leaf of baf . From LU (A) = out, there are some attacks
(B, A) ∈ UATT such that LU (B) = in, or for any supports (A, A) ∈ USUP ,
LU (A) = out (i.e., there exists C ∈ A such that LU (C) = out). From the
induction hypothesis, there are some attacks (B, A) ∈ UATT such that L(B) =
in, or for any supports (A, A) ∈ USUP , there exists C ∈ A such that C ∈ AR or
L(C) = out. For the former case, (B, A) ∈ ATT , and therefore L(A) = out. For
the latter case, for any (A, A) ∈ SUP , L(A) = out, and therefore L(A) = out.
90 T. Kawasaki et al.

From the above, A ∈ AR ∧ L(A) = LU (A), or A ∈ AR ∧ LU (A) = out. 




For a TDN, the facts found by PC (A) make the argument A acceptable.

Theorem 2. Assume that ubaf is acyclic and that A is an argument in UAR.


If Ex ∪ PC (A) is consistent and baf is built by BUP from Ex ∪ PC (A), then
A ∈ AR, and the complete labeling L satisfies L(A) = in. If Ex ∪ NC (A) is
consistent and baf is built by BUP from Ex ∪ NC (A), then A ∈ AR, or A ∈ AR
and the complete labeling L satisfies L(A) = out.

Proof. We prove this by induction on the height of A. For the former case, assume
that Ex ∪ PC (A) is consistent. When A is a leaf (thus of height 0), PC (A) =
{ex(A)} (i.e., baf includes A and ex(A)), and therefore,  L(A) = in. Other-
wise, for some A satisfying (A, A) ∈ USUP , PC (A) = (B,A)∈UATT NC (B) ∪

C∈A PC (C). For each B such that (B, A) ∈ ATT , NC (B) ⊆ PC (A), and
Ex ∪ NC (B) is thus consistent. As the height of B is less than that of A, from
the induction hypothesis, B ∈ AR or B ∈ AR but L(B) = out. In a similar
fashion, for each C ∈ A, C ∈ AR and L(C) = in, and therefore L(A) = in.
From the definitions of BUP and complete labeling, A ∈ AR and L(A) = in.
The proof for the case of NC (A) is the same. 


5 Related Works
Support relations play important roles in our approach. Such relations can be
interpreted in several ways [12]. Cayrol et al. defined several types of indirect
attacks by combining attacks with supports, and defined several types of exten-
sions in BAF [10]. Boella et al. revised the semantics by introducing different
meta-arguments and meta-supports [6]. Noueioua et al. developed a BAF that
considered a support relation to be a “necessity” relation [18]. C̆yras et al. consid-
ered that several semantics of a BAF could be captured using assumption-based
argumentation [13]. Brewka et al. developed an abstract dialectical framework
(ADF) as a generalization of Dung’s AF [7,8]; a BAF was represented using an
ADF. These works focus on acceptance of arguments. Here, we define a support
relation and develop semantics that can represent a law.
Several authors have studied changes in AFs when arguments are added or
deleted [14]. Cayrol et al. investigated changes in acceptable arguments when an
argument was added to a current AF [11]. Baumann et al. developed a strat-
egy for AF diagnosis and repair, and explored the computational complexity
thereof [3]. Most research has focused on semantics, and changes in acceptable
sets when arguments are added/deleted. The computational complexity associ-
ated with AF updating via argument addition/deletion is a significant issue [1].
Here, we propose the reasoning based on an incrementally constructed BAF,
potentially broadening the applications of such frameworks. Complexity is not
of concern; we do not need to consider all possibilities since solutions can be
derived from a given universal BAF. However, it is possible to use efficient com-
putational methods when executing our algorithm.
Hybrid Reasoning on a Bipolar Argumentation Framework 91

Our reasoning mechanism may be considered a form of hypothetical rea-


soning, or an abduction, which is a method used to search for the set of facts
necessary to derive an observed conclusion [19]. In assumption-based argumen-
tation, abduction is used to explain a conclusion supported by an argument [5].
Combinations of abduction and argumentation have been discussed in several
works. Kakas et al. developed a method to determine the conditions that support
arguments [16]. Sakama studied an abduction in argumentation framework [22]
and proposed a method to search for the conditions explaining the justification
state. This may include removal of an argument if it is not justified. Also, a
computational method was developed by transforming an AF into a logic pro-
gram. In our approach, we do not remove arguments; instead, we add absence
arguments, which is equivalent to argument removal. It is reasonable to confirm
the existence or absence of evidential facts when aiming to establish whether
a certain law applies. The difference between the cited works and our method
is that, in the previous works, observations are given and the facts that can
explain those arguments are searched. In our case, potential conclusions justi-
fied by the observed facts are not specified; instead, bidirectional reasoning is
performed repeatedly to assemble a knowledge set in an incremental manner. In
addition, the purpose of our research is to support simulations. A minimal set of
facts does not necessarily yield the best solution, unlike the cases of conventional
hypothetical reasoning and common abduction.

6 Conclusion

In this paper, we developed a hybrid method featuring both bottom-up and


top-down reasoning using an incrementally constructed BAF. The method can
be applied to find a relevant law based on proven facts, and suggests facts that
might make another law applicable. The proposed method can support those
who are not familiar with a law through a simulation process, allowing a better
understanding of the law to be achieved, in addition to identifying potential
strategies for winning the case.
We are currently exploring reasoning processes that use three-valued repre-
sentation, of which undecided is one possible representation. In future, we plan
to implement visualization of our method.

Acknowledgment. This work was supported by JSPS KAKENHI Grant Number


JP17H06103.

References
1. Alfano, G., Greco, S., Parisi, F.: A meta-argumentation approach for the efficient
computation of stable and preferred extensions in dynamic bipolar argumentation
frameworks. Intelligenza Artificiale 12(2), 193–211 (2018)
2. Amgoud, L., Cayrol, C., Lagasquie-Schiex, M.C., Livet, P.: On bipolarity in argu-
mentation frameworks. Int. J. Intell. Syst. 23(10), 1062–1093 (2008)
92 T. Kawasaki et al.

3. Baumann, R., Ulbricht, M.: If nothing is accepted - repairing argumentation frame-


works. In: Proceedings of KR 2010, pp. 108–117 (2018)
4. Bench-Capon, T., Prakken, H., Sartor, G.: Argumentation in legal reasoning. In:
Simari, G., Rahwan, I. (eds.) Argumentation in Artificial Intelligence, pp. 363–382.
Springer, Boston (2009). https://doi.org/10.1007/978-0-387-98197-0 18
5. Bondarenko, A., Dung, P.M., Kowalski, R., Toni, F.: An abstract, argumentation-
theoretic approach to default reasoning. Artif. Intell. 93, 63–101 (1997)
6. Boella, G., Gabbay, D.M., van der Torre, L., Villata, S.: Support in abstract argu-
mentation. In: Proceedings of COMMA 2010, pp. 40–51 (2010)
7. Brewka, G., Woltran, S.: Abstract dialectical frameworks. In: Proceedings of KR
2010, pp. 102–111 (2010)
8. Brewka, G., Ellmauthaler, S., Strass, H., Wallner, J.P., Woltran, S.: Abstract
dialectical frameworks revisited. In: Proceedings of IJCAI 2013, pp. 803–809 (2013)
9. Caminada, M.: On the issue of reinstatement in argumentation. In: Proceedings of
JELIA 2006, 111–123 (2006)
10. Cayrol, C., Lagasquie-Schiex, M.: On the acceptability of arguments in bipo-
lar argumentation frameworks. In: Proceedings of ECSQARU 2005, pp. 378–389
(2005)
11. Cayrol, C., de Saint-Cyr, F.D., Lagasquie-Schiex, M.: Change in abstract argu-
mentation frameworks: adding an argument. J. Artif. Intell. Res. 28, 49–84 (2010)
12. Cohen, A., Gottifredi, S., Garcia, A., Simari, G.: A survey of different approaches
to support in argumentation systems. Knowl. Eng. Rev. 29(5), 513–550 (2013)
13. C̆yras, K., Schulz, C., Toni, F.: Capturing bipolar argumentation in non-flat
assumption-based argumentation. In: Proceedings of PRIMA 2017, pp. 386–402
(2017)
14. Doutre, S., Jean-Guyb, M.: Constraints and changes: a survey of abstract argu-
mentation dynamics. Argum. Comput. 9(3), 223–248 (2018)
15. Dung, P.M.: On the acceptability of arguments and its fundamental role in non-
monotonic reasoning, logic programming and n-person games. Artif. Intell. 77,
321–357 (1995)
16. Kakas, A.C., Moraitis, P.: Argumentative agent deliberation, roles and context.
Electron. Notes Theor. Comput. Sci. 70, 39–53 (2002)
17. Kawasaki, T., Moriguchi, S., Takahashi, K.: Transformation from PROLEG to a
bipolar argumentation framework. In: Proceedings of SAFA 2018, pp. 36–47 (2018)
18. Nouioua, F., Risch, V.: Argumentation framework with necessities. In: Proceedings
of SUM 2011, pp. 163–176 (2011)
19. Poole, D.: Logical framework for default reasoning. Artif. Intell. 36, 27–47 (1988)
20. Prakken, H., Sartor, G.: Law and logic: a review from an argumentation perspec-
tive. Artif. Intell. 36, 214–245 (2015)
21. Rahwan, I., Simari, G. (eds.): Argumentation in Artificial Intelligence. Springer,
Boston (2009). https://doi.org/10.1007/978-0-387-98197-0
22. Sakama, C.: Abduction in argumentation frameworks. J. Appl. Non-Class. Log.
28, 218–239 (2018)
23. Satoh, K., et al.: PROLEG: an implementation of the presupposed ultimate fact
theory of Japanese civil code by PROLOG technology. In: Onada, T., Bekki,
D., McCready, E. (eds.) JSAI-isAI 2010. LNCS (LNAI), vol. 6797, pp. 153–164.
Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25655-4 14
Active Preference Elicitation by Bayesian
Updating on Optimality Polyhedra

Nadjet Bourdache(B) , Patrice Perny, and Olivier Spanjaard

Sorbonne Université, CNRS, LIP6, 75005 Paris, France


{nadjet.bourdache,patrice.perny,olivier.spanjaard}@lip6.fr

Abstract. We consider the problem of actively eliciting the prefer-


ences of a Decision Maker (DM) that may exhibit some versatility when
answering preference queries. Given a set of multicriteria alternatives
(choice set) and an aggregation function whose parameter values are
unknown, we propose a new incremental elicitation method where the
parameter space is partitioned into optimality polyhedra in the same
way as in stochastic multicriteria acceptability analysis. Each polyhedron
encompasses the subset of parameter values for which a given alternative
is optimal (one optimality polyhedron, possibly empty, per alternative
in the choice set). The uncertainty about the DM’s judgment is modeled
by a probability distribution over the polyhedra of the partition. At each
step of the elicitation procedure, the distribution is revised in a Bayesian
manner using preference queries whose choice is based on the current
solution strategy, that we adapt to minimize the expected regret of the
recommended alternative. We interleave the analysis of the set of alter-
natives with the elicitation of the parameters of the aggregation function
(weighted sum or ordered weighted average). Numerical tests have been
performed to evaluate the interest of the proposed approach.

Keywords: Incremental preference elicitation · Optimality


polyhedra · Bayesian updating · Expected regrets

1 Introduction

Preference elicitation is an essential part of computer-aided multicriteria deci-


sion support. Indeed, criteria being often conflicting, the notion of optimality is
subjective and fully depends on the Decision Maker’s (DM) view on the relative
importance attached to every criteria. Thus, the relevance of the recommenda-
tion depends on our ability to elicit this information and the way we model the
uncertainty about the DM’s preferences.
A standard way to compare feasible solutions in multicriteria decision prob-
lems is to use parameterized aggregation functions assigning a value (overall
utility) to every solution. This function can be fitted to the DM preferences
by eliciting the weighting coefficients that specify the importance of criteria in
the aggregation. In many real cases, it is impractical but also useless to precisely
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 93–106, 2019.
https://doi.org/10.1007/978-3-030-35514-2_8
94 N. Bourdache et al.

specify the parameters of the aggregation function. Given a decision model, exact
choices can often be derived from a partial specification of weighting parame-
ters. Dealing with partially specified parameters requires the development of
solution methods that can determine an optimal or near optimal solution with
such partial information. This is the aim of incremental preference elicitation,
that consists on interleaving the elicitation with the exploration of the set of
alternatives to adapt the elicitation process to the considered instance and to
the DM’s answers. Thus, the elicitation effort is focused on the useful part of
the preference information. The purpose of incremental elicitation is not to learn
precisely the values of the parameters of the aggregation function but to specify
them sufficiently to be able to determine a relevant recommendation.
Incremental preference elicitation is the subject of several contributions in
various contexts, see e.g. [3,4,7,16]. Starting from the entire set of possible
parameter values, incremental elicitation methods are based on the reduction
of the uncertainty about the parameter values by iteratively asking the DM to
provide new preference information (e.g., with pairwise comparisons between
alternatives). Any new information is translated into a hard constraint that
allows to reduce the parameter space. In this way, preference data are collected
until a necessarily optimal or near optimal solution can be determined, i.e., a
solution that is optimal or near optimal for all the possible parameter values.
These methods are very efficient because they allow a fast reduction of the
parameter space. Nevertheless, they are very sensitive to possible mistakes of
the DM in her answers. Indeed, in case of a wrong answer, the definitive reduc-
tion of the parameter space will exclude the wrong part of the set of possible
parameter values, which is likely to exclude the optimal solution from the set of
possibly optimal solutions (i.e., solutions that are optimal for at least one possible
parameter value). Consequently, the relevance of the recommendation may be
significantly impacted if there is no possible backtrack. A way to overcome this
drawback is to use probabilistic approaches that allow to model the uncertainty
about the DM’s answers, and thus to give her the opportunity to contradict
herself without impacting too much the quality of the recommendation. In such
methods, the parameter space remains unchanged throughout the algorithm and
the uncertainty about the real parameter values (which characterize the DM’s
preferences) is represented by a probability density function that is updated
when new preference statements are collected.
This idea has been developed in the literature. In the context of incremental
elicitation of utility values, Chajewska et al. [8] proposed to update a proba-
bility distribution over the DM’s utility function to represent the belief about
the utility value. The probability distribution is incrementally adjusted until the
expected loss of the recommendation is sufficiently small. This method does not
apply in our setting because we consider that the utility values of the alternatives
on every criterion are known and that we elicit the values of the weighting coef-
ficients of the aggregation function. Sauré and Vielma [15] introduced a method
based on maintaining a confidence ellipsoid region using a multivariate Gaussian
distribution over the parameter space. They use mixed integer programming to
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 95

select a preference query that is the most likely to reduce the volume of the
confidence region. In a recent work [5], the uncertainty about the parameter
values is represented by a Gaussian distribution over the parameter space of
rank-dependent aggregation functions. Preference queries are selected by min-
imizing expected regrets to update the density function using Bayesian linear
regression. As the updating of a continuous density function is computationally
cumbersome (especially when analytical results for the obtention of the poste-
rior density function do not exist), data augmentation and sampling techniques
are used to approximate the posterior density function. These methods are time
consuming and require to make a tradeoff between computation time and accu-
racy of the approximation. In addition, the information provided by a continuous
density function may be much richer than the information really needed by the
algorithm to conclude. Indeed, it is generally sufficient to know that the true
parameter values belong to a given restricted area of the parameter space to
be able to identify an optimal solution without ambiguity. Thus, we introduce
in this paper a new model-based incremental elicitation algorithm based on a
discretization of the parameter space. We partition the parameter space into
optimality polyhedra and we define a probability distribution over the partition.
After each query, this distribution is updated using Bayes’ rule.
The paper is organised as follows. Section 2 recalls some background on
weighted sums and ordered weighted averages. We also introduce the optimality
polyhedra we use in our method and we discuss our contribution with regard to
related works relying on the optimality polyhedra. We present our incremental
elicitation method in Sect. 3. Finally, some numerical tests showing the interest
of the proposed approach are provided in Sect. 4.

2 Background and Notations


Let X be a set of n alternatives evaluated on p criteria. Any alternative of X
is characterized by a performance vector x = (x1 , . . . , xp ), where xi ∈ [0, U ] is
the performance of the alternative on criterion i, and U is the maximum utility
value. All utilities xi are expressed on the same scale; the utility functions must
be defined from the input data (criterion or attribute values), as proposed by,
e.g., Grabisch and Labreuche [10]. To refine the Pareto dominance relation and to
be able to better discriminate between alternatives in X , we use a parametrized
aggregation function denoted by fw . The weighting vector w of the function
defines how the components of x should be aggregated and thus makes it pos-
sible to model the decision behavior of the DM. In this paper, we consider two
operators: the weighted sum (WS) and the ordered weighted average (OWA).
We give some notations and recall some basic notions about this two aggregation
functions in the following.

Weighted Sum. Let x ∈ Rp+ be a performance vector and w ∈ Rp+ be a


weighting vector. The weighted sum is defined by:
p

WSw (x) = wi xi (1)
i=1
96 N. Bourdache et al.

Ordered Weighted Average. Introduced by Yager [17], the OWA is a rank-


dependent aggregation function, where the weights are not associated to the
criteria but to the ranks in the ordered performance vector, giving more or less
importance to good or bad performances. Let x ∈ Rp+ be a performance vector
and w ∈ Rp+ be a weighting vector. The ordered weighted average is defined by:
p

OWAw (x) = wi x(i) (2)
i=1

where x(.) is a permutation of vector x such that x(1) ≤ · · · ≤ x(p) .

Example 1. Let x = (14, 9, 10), y = (10, 12, 10) and z = (9, 16, 6) be three perfor-
mance vectors to compare, and assume that the weighting vector is w = ( 14 , 12 , 14 ).
Applying Eq. (2), we obtain: OWAw (x) = 10.75 > OWAw (y) = 10.5 >
OWAw (z) = 10.

Note that OWA includes the minimum (w1 = 1 and wi = 0, ∀i ∈ 2, p), the
maximum (wp = 1 and wi = 0, ∀i ∈ 1, p − 1), the arithmetic mean (wi = p1 , ∀i ∈
1, p) and all other order statistics as special cases.
If w is chosen with decreasing components (i.e., the greatest weight is
assigned to the worst performance), the OWA function is concave and well-
balanced performance vectors are favoured. We indeed have, for all x ∈ X ,
OWAw ((x1 , . . . , xi − ε, . . . , xj + ε, . . . , xp )) ≥ OWAw (x) for all i, j and ε > 0
such that xi − xj ≥ ε. Depending on the choice of the weighting vector w, a
concave OWA function allows to define a wide range of mean type aggregation
operators between the minimum and the arithmetic mean. In the remainder of
the paper, we only consider concave OWA functions. For the sake of brevity, we
will say OWA for concave OWA.

Example 2. Consider vectors x, y and z defined in Example 1 and assume


that the weighting vector is now w = ( 12 , 13 , 16 ). We have: OWAw (x) = 61 6 ,
OWAw (y) = 62 6 and OWA w (z) = 52
6 . The alternative y, which corresponds to
the most balanced performance vector, is the preferred one.

Using fw (defined with (1) or (2)) as an aggregation function, we call fw -


optimal an alternative x that maximizes fw (x). Eliciting the DM’s preferences
amounts to eliciting the weighting vector w. The rest of the section defines how
we deal with the imprecise knowledge of the parameter values in the optimization
process involved in the elicitation.

Optimality Polyhedra. We denote by W the set of all feasible weighting


vectors. Note that, to limit the scale of  this set, one can add the additional
p
normalisation constraint i=1 wi = 1. Thus, W is defined by
non restrictive 
p p
W = {w ∈ R+ | i=1 wi = 1 and wi ≥ 0, ∀i}. In the case of a concave OWA, the
additional constraint w1 ≥ · · · ≥ wp is enforced.
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 97

Starting from W and the set X of alterna-


tives, we partition W into optimality polyhe-
dra: the optimality polyhedron associated to
an alternative x is the set of weighting vectors
such that x is optimal. Note that the aggre-
gation functions we use are linear in w (even
though OWA is not linear in x because of the
sorting of x before applying the aggregation
operation).
This explains why the sets of the partition
are convex polyhedra. Any preference state-
ment of the form “Alternative x is preferred
to alternative y” is indeed translated into a
constraint fw (x) ≥ fw (y) which is linear in w. Fig. 1. Optimality polyhedra for
More formally, the optimality polyhedron x, y and z in Example 1 with WS.
Wx associated to an alternative x ∈ X
is defined by Wx = {w ∈ W |fw (x) ≥
fw (y), ∀y ∈ X }. Note that any empty set Wx (there is no w ∈ W such that
x is fw -optimal) or not full dimensional set (i.e., ∀w ∈ Wx , ∃y ∈ X such
that fw (x) = fw (y)) can be omitted. An example of such partition is given
in Fig. 1 for the instance of Example 1, where the aggregation function is a
weighted sum. Note that w3 can be omitted thanks to the normalization con-
straint (w3 = 1 − w1 − w2 ).
In order to represent the uncertainty about the exact values of parameters,
a probability distribution is defined over the polyhedra of the partition. This
distribution is updated using an incremental elicitation approach that will be
described in the next section.

Related Works. The idea of partitioning the parameter space is closely related
to Stochastic Multiobjective Acceptability Analysis (SMAA for short). The
SMAA methodology has been introduced by Charnetski and Soland under the
name of multiple attribute decision making with partial information [9]. Given
a set of utility vectors and a set of linear constraints characterizing the feasible
parameter space for a weighted sum (partial information elicited from the DM),
they assume that the probability of optimality for each alternative is proportional
to the hypervolume of its optimality polyhedron (the hypervolume reflects how
likely an alternative is to be optimal). Lahdelma et al. [12] developed this idea
in the case of imprecision or uncertainty in the input data (utilities of the alter-
natives according to the different criteria) by considering the criteria values as
probability distributions. They defined the acceptability index for an alternative,
that measures the variety of different valuations which allow for that alterna-
tive to be optimal, and is proportional to the expected volume of its optimality
polyhedron. They also introduced a confidence factor, that measures if the input
data is accurate enough for making an informed decision. The methodology has
been adapted to the 2-additive Choquet integral model by Angilella et al. [2].
These works consider that the uncertainty comes from the criterion values or
98 N. Bourdache et al.

from the variation in the answers provided by different DMs. They also consider
that some prior preference information is given and that there is no opportunity
to ask the DM for new preference statements. Our work differentiates from these
works in the following points:
– the criterion values are accurately known and only the parameter values of
the aggregation function must be elicited;
– the uncertainty comes from possible errors in the DM’s answers to preference
queries;
– the elicitation process is incremental.

3 Incremental Elicitation Approach


Once the parameter space W is partitioned into optimality polyhedra as explained
above, a prior density function is associated to the partition. This distribution
informs us on how likely each solution is to be optimal. In the absence of a
prior information about the DM’s preferences, we define the prior distribution
such that the probability of any polyhedron is proportional to its volume, as
suggested by Charnetski and Soland [9]. The volume of Wx gives indeed a mea-
sure on the proportion of weighting vectors for which the alternative x is ranked
vol
first. More formally, the prior probability of x to be optimal is P (x) = volWWx
where volW denotes the volume of a convex polyhedron W. We assume here a
complete ignorance of the continuous probability distribution for w within each
polyhedron. After each new preference statement, the probability distribution P
is updated using Bayes’ rule.
The choice of the next query to ask is a key point for the efficiency of the
elicitation process in acquiring enough preferential information to make a rec-
ommendation with sufficient confidence.

Query Selection Strategy. In order to get the most informative possible


query we use a strategy based on the minimization of expected regrets. Let us
first introduce how we define expected regrets in our setting:

Definition 1. Given two alternatives x and y, and a probability distribution P


on X , the pairwise expected regret PER is defined by:

PER(x, y, X , P ) = max{0, PMR(x, y, Wz )}P (z)
z∈X

where P (z) represents the probability for z to be optimal and PMR(x, y, W) is


the pairwise maximum regret over a polyhedron W, defined by:

PMR(x, y, W) = max {fw (y) − fw (x)}


w∈W

In other words, the PER defines the expected worst utility loss incurred by
recommending an alternative x instead of an alternative y, and PMR(x, y, W)
is the worst utility loss in recommending alternative x instead of alternative y
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 99

given that w belongs to W. The use of the PMR within a polyhedron is justified
by the complete ignorance about the probability distribution in the polyhedron,
thereby, the worst case is considered.

Definition 2. Given a set X of alternatives, the maximum expected regret of


x ∈ X and the minimax expected regret over X are defined by:

MER(x, X , P ) = max PER(x, y, X , P )


y∈X
MMER(X , P ) = min MER(x, X , P )
x∈X

In other words, the MER value defines the worst utility loss incurred by
recommending an alternative x ∈ X and the MMER value defines the minimal
MER value over X .
The notion of regret expresses a measure of the interest of an alternative.
At any step of the algorithm, the solution achieving the MMER value is a rel-
evant recommendation because it minimizes the expected loss in the current
state of knowledge. It also allows to determine an informative query to ask.
Various query selection strategies based on regrets and expected regrets have
indeed been introduced in the literature, see e.g. [6] in a deterministic con-
text (current solution strategy) and [11] in a probabilistic context (a probabil-
ity distribution is used to model the uncertainty about the parameter values).
Adapting the current solution strategy to our probabilistic setting, we propose
here a strategy that consists in asking the DM to compare the current rec-
ommendation x∗ = arg minx∈X MER(x, X , P ) to its best challenger defined by
y ∗ = arg maxy∈X PER(x∗ , y, P ). The current probability distribution is then
updated according to the DM’s answer, as explained hereafter. The procedure
can be iterated until the MMER value drops below a predefined threshold ε.
The approach proposed in this paper consists in interleaving preference
queries and Bayesian updating of the probability distribution based on the DM’s
answers. The elicitation procedure is detailed in Algorithm 1. At each step i of
the algorithm, we ask the DM to compare two alternatives x(i) and y (i) . The
answer is denoted by ai , where ai = 1 if x(i) is preferred to y (i) and ai = 0 other-
wise. From each answer ai , the conditional probability P (.|a1 , . . . , ai−1 ) over the
set of alternatives is updated in a Bayesian manner (Line 13 of Algorithm 1).

Bayesian Updating. We assume that answers ai are independent binary ran-


dom variables, i.e. P (ai |x(i) , y (i) ) only depends on the (unknown) weighting vector
w and on the performance vectors of x(i) , y (i) . This is a standard assumption in
Bayesian analysis of binary response data [1]. To alleviate the notations, we omit
the conditioning statement in P (ai |x(i) , y (i) ), that we abbreviate by P (ai ). Using
Bayes’ rule, the posterior probability of any alternative z ∈ X is given by:

P (a1 , . . . , ai |z)P (z) P (ai |z)P (a1 , . . . , ai−1 |z)P (z)


P (z|a1 , . . . , ai ) = = (3)
P (a1 , . . . , ai ) P (ai )P (a1 , . . . , ai−1 )
P (ai |z)P (z|a1 , . . . , ai−1 )
= (4)
P (ai )
100 N. Bourdache et al.

Algorithm 1: Incremental Elicitation Procedure


Input: X : set of alternatives, ε: acceptance threshold; W : parameter space.
Output: x∗ : best recommendation in X
vol
1 P (z) ← volW W
z
, ∀z ∈ X
2 i←0
3 repeat
4 i←i+1
5 x(i) ← arg minx∈X MER(x, X , P (.|a1 , . . . , ai−1 ))
6 y (i) ← arg maxy∈X PER(x(i) , y, P (.|a1 , . . . , ai−1 ))
7 Ask the DM if x(i) is preferred to y (i)
8 if the answer is yes then
9 ai ← 1
10 else
11 ai ← 0
12 for z ∈ X do
13 Compute P (z|a1 , . . . , ai ) using Bayesian updating
14 until MMER(X , P (.|a1 , . . . , ai )) ≤ ε;
15 return x∗ selected in arg minx∈X MER(x, X , P (.|a1 , . . . , ai ))

The likelihood function P (ai |z) is the conditional probability that the answer is
ai given that z is optimal. Let us denote by Wx(i) y(i) the subset of W containing
all vectors w such that fw (x(i) ) ≥ fw (y (i) ); the likelihood function is defined as:

⎨δ if Wz ⊆ Wx(i) y(i)
P (ai = 1|z) = 1 − δ if Wz ∩ Wx(i) y(i) = ∅

P (ai = 1) otherwise

where δ ∈ ( 12 , 1] is a constant. The corresponding update of the probability masses


follows the idea used by Nowak in noisy generalized binary search [14] and its
effect is simple; the probability masses of polyhedra that are compatible with
the preference statement are boosted relative to those that are not compatible,
while the probability masses of the other polyhedra remain unchanged. The
parameter δ controls the size of the boost, and can be seen as a lower bound on
the probability of a correct answer. The three cases are depicted in Fig. 2.
In the third case (on the right of Fig. 2), due to the assumption of complete
ignorance within a polyhedron, the new preference statement is not informative
enough to update the probability of z to be optimal. Therefore, for all alterna-
tives z such that Wz is cut by the constraint fw (x(i) ) ≥ fw (y (i) ) no updating
is performed and therefore P (ai |z) = P (ai ); consequently P (z|a1 , . . . , ai ) =
P (z|a1 , . . . , ai−1 ) by Eq. 4.
Regarding Eq. 4, note that, in practice, we do not need to determine
P (ai ). For any alternative z ∈ X such that Wz is not cut by the constraint,
we have indeed P (z|a1 , . . . , ai ) ∝ P (ai |z)P (z|a1 , . . . , ai−1 ). More precisely,
P (z|a1 , . . . , ai ) is obtained by the following equation:
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 101

Wz ⊆ Wx(i) y(i) Wz ∩ Wx(i) y(i) = ∅ otherwise

Fig. 2. The polyhedron is Wz . The non-hatched area is the half-space Wx(i) y(i) .

 P (ai |z)P (z|a1 , . . . , ai−1 )


P (z|a1 , . . . , ai ) = P (y|a1 , . . . , ai−1 )  (5)
P (ai |y)P (y|a1 , . . . , ai−1 )
y∈Y
y∈Y

where Y is the subset of alternatives


 whose optimality polyhedra are not cut by
the constraint. The condition z∈X P (z|a1 , . . . , ai ) = 1 obviously holds.
If the optimal alternative x∗ is unique, the proposition below states that,
using Algorithm 1, the probability assigned to x∗ cannot decrease if the DM
always answers correctly.
Proposition 1. Let us denote by x∗ a uniquely optimal alternative. At any step
i of Algorithm 1, if the answer to query i is correct, then:
P (x∗ |a1 , . . . , ai ) ≥ P (x∗ |a1 , . . . , ai−1 )
Proof. Two cases can be distinguished:
Case 1. If Wx∗ ⊆ Wx(i) y(i) and Wx∗ ∩ Wx(i) y(i) = ∅, then, as mentioned above,
P (x∗ |a1 , . . . , ai ) = P (x∗ |a1 , . . . , ai−1 ) by Eq. 4 because P (ai |x∗ ) = P (ai ).
Case 2. Otherwise, whatever the answer α of the DM, we have P (ai = α|x∗ ) = δ
because the answer to query i is correct. By Eq. 5, it follows that:


δ y∈Y P (y|a1 , . . . , ai−1 )
P (x |a1 , . . . , ai ) =  P (x∗ |a1 , . . . , ai−1 )
y∈Y P (ai = α|y)P (y|a 1 , . . . , ai−1 )
 

ratio ρ

We now show that ρ ≥ 1 for δ > 12 . Let us denote by Yδ the subset of alternatives
y ∈ Y such that P (ai = α|y) = δ. We have:

P (ai = α|y)P (y|a1 , . . . , ai−1 )
y∈Y
 
= δ P (y|a1 , . . . , ai−1 ) + (1 − δ) P (y|a1 , . . . , ai−1 )
y∈Yδ y∈Y1−δ

because Y = Yδ ∪ Y1−δ and Yδ ∩ Y1−δ = ∅


102 N. Bourdache et al.

 
≤ δ P (y|a1 , . . . , ai−1 ) + δ P (y|a1 , . . . , ai−1 )
y∈Yδ y∈Y1−δ
1
because δ > (the only case of equality is when Y1−δ = ∅)
2

= δ P (y|a1 , . . . , ai−1 )
y∈Y

Consequently, ρ ≥ 1 and thus P (x∗ |a1 , . . . , ai ) ≥ P (x∗ |a1 , . . . , ai−1 ). 


Toward an Efficient Implementation. As mentioned above, in order to


update the probability of an alternative z, we need to know the relative position
of its optimality polyhedron Wz compared to the constraint induced by the new
preference statement fw (x(i) ) ≥ fw (y (i) ). In this purpose, we can consider the
Linear Programs (LPs) opt{fw (x(i) )−fw (y (i) )|w ∈ Wz }, where opt = min or max.
If the optimal values of both LPs share the same sign, then we can conclude
that the polyhedron is not cut by the constraint, otherwise it is cut. To limit
the number of LPs that need to be solved (determining the positions of all the
polyhedra would indeed require to solve 2n LPs), and thereby speed up the
Bayesian updating, we propose to approximate the polyhedra by their outer
Chebyshev balls (i.e., the smallest ball that contains the polyhedron). Let us
denote by r the radius of the Chebyshev ball and by d the distance between the
center of the ball and the hyperplane induced by the preference statement:
– if d ≥ r then the polyhedron is not cut by the constraint (see Fig. 3a). In
order to know whether the polyhedron verifies the constraint or not, we just
need to check whether the center of the ball verifies it or not. Thus, in this
case, only two scalar products are required.
– if d < r then an exact computation is required because the polyhedron can
either be cut by the constraint (Fig. 3b) or not (Fig. 3c). In this way, the use
of Chebyshev balls does not impact the results of the Bayesian updating but
only speeds up the computations.

Fig. 3. Example of an approximation of a polyhedron by an outer Chebyshev ball.


Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 103

4 Experimental Results
Algorithm 1 has been implemented in Python using the polytope library to man-
age optimality polyhedra, and tested on randomly generated instances. We per-
formed the tests on an Intel(R) Core(TM) i7-4790 CPU with 15 GB of RAM.

Random Generation of Instances. To evaluate the performances of


Algorithm 1, we generated instances with 100 alternatives evaluated on 5 criteria,
all possibly fw -optimal (i.e., Wx = ∅ ∀x ∈ X ). The generation of the performance
vectors depends on the aggregation function (WS or OWA) that is considered:
– WS instances. An alternative x of the instance is generated as follows: a
vector y of size 4 is uniformly drawn in [0, 1]4 , then x is obtained by setting
xi = yi − yi−1 for i = 1, . . . , 5, where y0 = 0 and y5 =  1. The vectors thus
5
generated all belong to the same hyperplane (because i=1 xi = 1 for all
x ∈ X ) and the set of possibly unique WS-optimal alternatives is therefore
significantly reduced (because the optimality polyhedra of many alternatives
are not full dimensional). To avoid this issue, as suggested by Li [13], we
apply the square root function on all components xi for all x ∈ X in order
to concavify the Pareto front. The set of performance vectors obtained is
illustrated on the left of Fig. 4 in the bicriteria case.
– OWA instances. An alternative x is possibly k OWA-optimal in a set X if
its Lorenz curve L(x) defined by Lk (x) = i=1 x(i) (k ∈ 1, 5) is possibly
WS-optimal in {L(x) : x ∈ X }. We say that a vector z is Lorenz if there
exists a vector x such that z = L(x). Given a Lorenz vector z, we denote
by L−1 (z) any vector x such that L(x) = z. For such a vector x, we have
x(i) = zi − zi−1 for all i = 1, . . . , 5, where z0 = 0. An alternative x of the
instance is generated as follows: we first generate a point y in the polyhedron
defined by the following linear constraints:


⎪ yi+1 ≥ yi ∀i ∈ 0, 4 (1)
⎨ (i + 1)2 y − i2
y ≥ i2
y − (i − 1)2
y ∀i ∈ 1, 4 (2)
i+1
(P) 5 2 5 i 2 i i−1

⎪ i yi = i=1 i (3)
⎩ i=1
y0 = 0
The set L = {(i2 yi )i∈1,5 : y ∈ P} contains vectors that are all Lorenz
thanks to constraints (1) and (2). Furthermore, they belong to the same
hyperplane due to constraint (3), and therefore they are all possibly WS-
optimal. Consequently, all the alternatives in the set {L−1 (z) : z ∈ L} are
possibly OWA-optimal. As above, to make them all possibly unique OWA-
optimal, the square root function is applied on each component of vectors

z in L. The obtained set is L = {(i yi )i∈1,5 : y ∈ P}. All the vectors
 −1 
in X = {L (z) : z ∈ L } are possibly unique OWA-optimal. Finally, to
generate
m an alternative x in X  , we randomly draw a convex combination
y = j=1 αj ŷ of vertices ŷ 1 , . . . , ŷ m of P. The obtained alternative is then
j

defined by x = L−1 ((i yi )i∈1,5 ). The set of performance vectors obtained
is illustrated on the right of Fig. 4 in the bicriteria case.
104 N. Bourdache et al.

Finally, for both types of instances, a hidden vector w is generated to simulate


the preferences of the DM.

Fig. 4. Example of WS (left) and OWA (right) instances with n = 20 and p = 2

Simulation of the Interactions with the DM. To simulate the DM’s answer
to query i, we represent the intensity of preference between alternatives x(i) and
y (i) by the variable u(i) = fw (x(i) ) − fw (y (i) ) + ε(i) where ε(i) ∼ N (0, σ 2 ) is
a Gaussian noise modelling the possible DM’s error, with σ determining how
wrong the DM can be. The DM states that x(i)  y (i) if and only if u(i) ≥ 0.

Analysis of the Results. We evaluated the efficiency of Algorithm 1 in terms


of the actual rank of the recommended alternative. We considered different values
for σ in order to test the tolerance to possible errors. More precisely, σ = 0 gives
an error free model while σ ∈ {0.1, 0.2, 0.3} models different rates of errors in the
answers to queries. In the considered instances, these values led to, respectively,
3.6%, 10% and 22% of wrong answers for WS and to 3.2%, 16% and 25% of wrong
answers for OWA. We set δ = 0.8, which corresponds to a prior assumption of an
error rate of 20%. Thus, the value of δ we used in the experiments is uncorrelated
to the ones of σ. The computation time between two queries is less than 1 s in
all cases. Results are averaged over 40 instances.
We first observed the evolution of the actual rank of the MMER alternative
over queries (actual rank according to a hidden weighting vector representing the
DM’s preferences). Figure 5 (resp. Fig. 6) shows the curves obtained for WS (resp.
OWA). We observe that the mean rank drops below 2 (out of 100 alternatives)
after about 14 queries for WS with σ < 0.3, while the same happens for OWA
whatever value of σ. We see that, in practice, the efficiency of the approach can
be significantly impacted only when the error rate becomes greater than 20%.
We next compared the performance of Algorithm 1 with a deterministic app-
roach described in [4], that consists in reducing the parameter space after each
query (assuming that all answers are correct). The results are illustrated by the
boxplots in Fig. 7 for WS, and in Fig. 8 for OWA. We can see that our proba-
bilistic approach is more tolerant to errors than the deterministic approach. As
the value of σ increases, the deterministic approach makes less and less rele-
vant recommendations. The deterministic approach indeed recommends, in the
Active Preference Elicitation by Bayesian Updating on Optimality Polyhedra 105

Fig. 5. Mean rank vs. queries (WS) Fig. 6. Mean rank vs. queries (OWA)

Fig. 7. Rank vs. error rate (WS) Fig. 8. Rank vs. error rate (OWA)

worst case, alternatives that are ranked around 90 while it is less than 40 for
Algorithm 1. More precisely, when σ = 0.3 (for both WS and OWA), in more
than 75% of instances, Algorithm 1 recommends an alternative with a better
rank than the mean rank obtained in the deterministic case.

5 Conclusion

We introduced in this paper a new model based incremental multicriteria elic-


itation method relying on a partition of the parameter space. The elements of
the partition are the optimality polyhedra of the different alternatives, relatively
to a weighted sum or an ordered weighted average. A probability distribution is
defined over this partition, where each probability represents the likelihood that
the true weighting vector belongs to the polyhedron. The approach is robust
to possible mistakes in the DM’s answers thanks to the incremental revision of
probabilities in a Bayesian setting. We provide numerical tests showing the effi-
ciency of the proposed algorithm in terms of number of queries, as well as the
interest of using such a probabilistic approach compared to a deterministic app-
roach. A short term research direction is to investigate if it is possible to further
speed up the Bayesian updating by using outer Löwner-John ellipsoids instead
106 N. Bourdache et al.

of Chebyshev balls. The answer is not straightforward because, on the one hand,
the use of ellipsoids indeed refines the approximation of the polyhedra, but on
the other hand, this requires the use of matrix calculations to establish whether
or not an ellipsoid is cut by the constraint induced by a preference statement.
Another natural research direction is to extend our approach to more sophisti-
cated aggregation functions admitting a linear representation, such as Weighted
OWAs and other Choquet integrals, to improve our descriptive possibilities.

References
1. Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response
data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)
2. Angilella, S., Corrente, S., Greco, S.: Stochastic multiobjective acceptability anal-
ysis for the choquet integral preference model and the scale construction problem.
Eur. J. Oper. Res. 240(1), 172–182 (2015)
3. Benabbou, N., Perny, P.: Incremental weight elicitation for multiobjective state
space search. In: AAAI-15, pp. 1093–1099 (2015)
4. Bourdache, N., Perny, P.: Active preference elicitation based on generalized gini
functions: application to the multiagent knapsack problem. In: AAAI 2019, pp.
7741–7748 (2019)
5. Bourdache, N., Perny, P., Spanjaard, O.: Incremental elicitation of rank-dependent
aggregation functions based on Bayesian linear regression. In: IJCAI 2019, pp.
2023–2029 (2019)
6. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
7. Braziunas, D., Boutilier, C.: Minimax regret based elicitation of generalized addi-
tive utilities. In: Proceedings of UAI-07, pp. 25–32 (2007)
8. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility
elicitation. In: Proceedings of AAAI-00, pp. 363–369 (2000)
9. Charnetski, J.R., Soland, R.M.: Multiple-attribute decision making with partial
information: the comparative hypervolume criterion. Nav. Res. Logist. Q. 25(2),
279–288 (1978)
10. Grabisch, M., Labreuche, C.: A decade of application of the Choquet and Sugeno
integrals in multi-criteria decision aid. Ann. OR 175(1), 247–286 (2010)
11. Guo, S., Sanner, S.: Multiattribute Bayesian preference elicitation with pairwise
comparison queries. In: NIPS, pp. 396–403 (2010)
12. Lahdelma, R., Hokkanen, J., Salminen, P.: SMAA - stochastic multiobjective
acceptability analysis. Eur. J. Oper. Res. 106(1), 137–143 (1998)
13. Li, D.: Convexification of a noninferior frontier. J. Optim. Theory Appl. 88(1),
177–196 (1996)
14. Nowak, R.: Noisy generalized binary search. In: Bengio, Y., Schuurmans, D.,
Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information
Processing Systems, vol. 22, pp. 1366–1374. Curran Associates, Inc. (2009)
15. Sauré, D., Vielma, J.P.: Ellipsoidal methods for adaptive choice-based conjoint
analysis. Oper. Res. 67, 315–338 (2019)
16. Wang, T., Boutilier, C.: Incremental utility elicitation with the minimax regret
decision criterion. IJCAI 3, 309–316 (2003)
17. Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria
decision making. IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988)
Selecting Relevant Association Rules
From Imperfect Data

Cécile L’Héritier1,2(B) , Sébastien Harispe1 , Abdelhak Imoussaten1 ,


Gilles Dusserre1 , and Benoı̂t Roig2
1
LGI2P, IMT Mines Ales, Univ Montpellier, Alès, France
{cecile.lheritier,sebastien.harispe,abdelhak.imoussaten,
gilles.dusserre}@mines-ales.fr
2
EA7352 CHROME, Université de Nı̂mes, Nı̂mes, France
{cecile.lhritier,benoit.roig}@unimes.fr

Abstract. Association Rule Mining (ARM) in the context of imper-


fect data (e.g. imprecise data) has received little attention so far despite
the prevalence of such data in a wide range of real-world applications.
In this work, we present an ARM approach that can be used to han-
dle imprecise data and derive imprecise rules. Based on evidence theory
and Multiple Criteria Decision Analysis, the proposed approach relies
on a selection procedure for identifying the most relevant rules while
considering information characterizing their interestingness. The several
measures of interestingness defined for comparing the rules as well as the
selection procedure are presented. We also show how a priori knowledge
about attribute values defined into domain taxonomies can be used to
(i) ease the mining process, and to (ii) help identifying relevant rules
for a domain of interest. Our approach is illustrated using a concrete
simplified case study related to humanitarian projects analysis.

Keywords: Association rules · Imperfect data · Evidence theory ·


Multiple Criteria Decision Analysis (MCDA)

1 Introduction
Association rule mining (ARM) is a well-known data mining technique designed
to extract interesting patterns in databases. It has been introduced in the context
of market basket analysis [1], and has received a lot of attention since then [15].
An association rule is usually formally defined as an implication between an
antecedent and a consequent, being conjunctions of attributes in a database, e.g.
“People who have age-group between 20 and 30 and a monthly income greater
than $2k are likely to buy product X”. Such rules are interesting for extracting
simple intelligible knowledge from a database; they can also further be used
in several applications, e.g. recommendation, customer or patient analysis. A
large literature is dedicated to the study of ARM, and numerous algorithms
have been defined for efficiently extracting rules handling a large range of data
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 107–121, 2019.
https://doi.org/10.1007/978-3-030-35514-2_9
108 C. L’Héritier et al.

types, e.g., nominal, ordinal, quantitative, sequential [15]. Nevertheless, only a


few contributions of the literature study the case of ARM with imperfect data,
e.g. [13,24], even if such data is central in numerous real-world applications.
In order to extend the body of work related to ARM with imperfect data, and
to answer some of the limitations of existing contributions, this paper presents
a novel ARM approach that can be used to handle imprecise data and derive
imprecise rules. In this study, to simplify, the proposed approach focuses on a
specific case where the antecedent and the consequent are composed of prede-
fined disjoint sets of attributes forming a partition of the whole set of attributes.
This particular case is relevant, for example in classification tasks in which the
label value to predict can be defined as consequent of the rules of interest. To
sum up, our goal is threefold: (i) to enrich the expressivity of existing proposed
frameworks, (ii) to complement them with a richer procedure for selecting rele-
vant rules, and (iii) to present simple way to incorporate domain knowledge to
ease the mining process, and to help identifying relevant rules for a domain of
interest. Based on the evidence theory framework and Multiple Criteria Decision
Analysis, a selection procedure for identifying the most relevant rules while con-
sidering information characterizing their interestingness is proposed. The several
measures of interestingness defined for comparing the rules, as well as the selec-
tion procedure, are presented. We also show how a priori knowledge in the form
of taxonomies about consequent and antecedent (i.e. attribute values) can be
used to focus on rules of interest for a domain. We also present an illustration
using a simplified case study related to humanitarian projects analysis.
The paper is structured as follows: Sect. 2 formally introduces traditional
ARM, the theoretical notions on which our approach is based, and formally
defines the problem we are considering. It also introduces related work focus-
ing on rule selection and ARM with imperfect data. The proposed approach is
detailed in Sect. 3, and Sect. 4 presents the illustration. Finally, perspectives and
concluding remarks are provided in Sect. 5.

2 Theoretical Background and Related Work


This section briefly presents some of the theoretical notions required to introduce
our work. We next provide the problem statement of ARM with imperfect data,
and our positioning w.r.t. existing contributions.

2.1 Theoretical Background

Association Rule Mining (ARM): In classical ARM [1], a database D =


{d1 , . . . , dm } to be mined consists of m observations of a set of n attributes. The
set of attribute indices is denoted by N = {1, . . . , n}. Each attribute i takes its
values in a discrete -boolean, nominal or numerical- finite scale denoted Θi . An
association ruler denoted r : X → Y  links an antecedent X with a consequent
Y where X ∈ Θi , I ⊂ N and Y ∈ Θj , J ⊆ N \ I.
i∈I j∈J
Selecting Relevant Association Rules From Imperfect Data 109

The main challenge in ARM is to extract interesting rules from a large search
space, e.g., n and m are large. In this context, defining the interestingness of a
rule is central.

Interestingness of Rules. Numerous works have studied notions related to


the interestingness of a rule, [16,22,23]. No formal and widely accepted defini-
tion arose from those works, and discussing the numerous existing formulations
is out of the scope of this paper. However, interestingness is generally regarded
as a general concept covering several features of interest for a rule, e.g. reliabil-
ity (how reliable is the rule?) and conciseness (is the rule complex?, i.e. based
on numerous attribute-value pairs). Other aspects of a rule are also considered,
e.g. peculiarity, surprisingness, or actionability, to name a few - the reader can
refer to [12] for details. The literature also distinguishes objective and subjective
measures, the latter being defined based on domain-dependent considerations.
The two main (objective) measures used in the literature are Support and Confi-
dence [2]. The support of a rule r : X → Y denoted supp(X → Y ) is traditionally
defined as the proportion of the realization of X and Y in D, and the confidence
denoted conf (X → Y ) is defined as the proportion of the realization of Y when
X is observed in D. Given support and confidence thresholds, ARM usually aims
at identifying rules exceeding those thresholds [2]. In classical ARM, support
and confidence are quantified using probability theory framework. When ARM
involves imperfect data, this quantification requires reformulating the problem
in a theoretical framework suited for handling data imperfection. In this work,
we focus on contributions based on evidence theory.

Evidence Theory has been introduced to represent imprecision and uncer-


tainty [21]. We briefly introduce its main concepts. Let Θ be a finite set of
elements being the most precise available information, referred to as the frame
ofdiscernment. A mass function m : 2Θ → [0, 1] is a set function such that
m(A) = 1. The quantity m(A), A ⊆ Θ is interpreted as the portion of
A⊆Θ
belief that is exactly committed to A and to nothing smaller. The subsets of Θ
having a strictly positive mass are called focal elements, their set is denoted F.
The total belief committed to any A  ⊆ Θ is measured by the belief function:
Bel : 2Θ → [0, 1] with Bel(A) = m(B). In evidence theory, Bel(A),
B⊆Θ,B⊆A
where A denotes the complement of A in Θ, is characterized through
 the notion
of plausibility: P l : 2Θ → [0, 1], with P l(A) = 1 − Bel(A) = m(B).
B⊆Θ,B∩A=∅
In order to provide a complete generalization of the probability framework,
conditioning has also been defined in evidence theory. Several expressions have
been proposed, none of them leading to a full consensus [7,10]. In this paper,
we will adopt the definition corresponding to the conditioning process stated
by Fagin et al. [10], a natural extension of the Bayesian conditioning. We do
not consider the definition proposed in Dempster [7] based on Dempster-Shafer
combination rule, where a new information is interpreted as a modification of
110 C. L’Héritier et al.

the initial belief function and used in a revision process [9]. Thus, for A, B ⊆ Θ,
such that Bel(A) > 0, we will further consider:

Bel(A ∩ B) P l(A ∩ B)
Bel(B|A) = , P l(B|A) =
Bel(A ∩ B) + P l(A ∩ B) P l(A ∩ B) + Bel(A ∩ B)

2.2 Problem Statement and Related Work

Problem Statement. In classical ARM, where only precise information is


considered, e.g., the value of attribute i is Xi ∈ Θi , i ∈ N . In this paper,
we consider observations as “the value of attribute i is in Ai ⊆ Θi ”. The case
Ai ⊂ Θi with |Ai | > 1 corresponds to imprecision, while Ai = Θi is considered
when information is missing, i.e. it corresponds to the ignorance about the value
of attribute i. In this setting, a rule r is defined as:
 
r : A → B where A = Ai , Ai ⊆ Θi and B = Bj , Bj ⊆ Θj
i∈I j∈J
for all I ⊂ N and J ⊆ N \ I

As mentioned previously, in this paper we consider the case where antecedent A


concerns only a subset I1 ⊂ N of attributes and consequent B concerns a subset
I2 ⊂ N where I1 and I2 form partition of N , and I1 = ∅. Thus:
 
r : A → B where A = Ai , Ai ⊆ Θi and B = Bj , Bj ⊆ Θj (1)
i∈I1 j∈I2

We denote by R the set of rules defined by Formula (1). The problem addressed
here is to reduce R by selecting only the relevant rules.

Related Work and Positioning. As stated in the introduction, our goal is


threefold: (i) to enrich the expressivity of existing proposed frameworks dedicated
to ARM with imperfect data, (ii) to complement them with a richer procedure
for selecting relevant rules (rule pruning), and (iii) to present a simple way to
incorporate domain knowledge to ease the mining process, and to help identifying
relevant rules for a domain of interest.1

Rule Pruning. Most of the approaches use thresholds to select rules - only using
support and confidence most often allows drastically reducing the number of
rules in traditional ARM [1]. A post-mining step is generally performed to rank
the remaining rules according to one specific interestingness measure -the mea-
sure used is generally selected according to the application domain and context-
specific measure properties [23,27]. Nevertheless, processing this way does not
1
Note that the simplification of the mining process here refers to a reduction of com-
plexity in terms of the number of rules analysed, i.e. search space size. Algorithmic
contributions and therefore complexity analyses regarding efficient implementations
of the proposed approach are left for future work.
Selecting Relevant Association Rules From Imperfect Data 111

enable selecting rules when conflicting interestingness measures are used, e.g.
maximizing both support and specificity of rules. This is the purpose of MCDA
methods. Some works propose to take advantage of MCDA methods [3–6,17]
in the context of ARM. Those works can be divided into two categories: (1)
those incorporating the end-user’s preferences using Analytic Hierarchy Process
(AHP) and Electre II [6], or using Electre tri [3]; and (2) those that do not incor-
porate such information and use Data Envelopment Analysis (DEA) [5,26], or
Choquet integral [17]. Our approach is hybrid and falls within the two categories.
First, selection is made based only on database information as in Bouker et al.
[4]. Second, if the set of selected rules is large, a trade-off based on end-user’s
preferences is used within an appropriate MCDA method. As our aim is to select
a subset of interesting rules, Electre I [18] seems to be the most appropriate.

ARM and Imperfect Data. Several frameworks have been studied to deal with
imperfect data in ARM. The assumptions entailed in the approaches based on
probabilistic models do not preserve imprecision and might lead to unreliable
inferences [13]. Uncertainty theories have also been investigated for imperfect
data in ARM using fuzzy logic [14], or using possibility theory [8]. In the case of
missing and incomplete data, evidential theory seems the appropriate setting to
handle ARM problem [13,19,24,25]. Our approach is adopting this setting. In
addition to studying a richer modelling that enables incorporating more infor-
mation, we propose to combine it with a selection process taking advantage of
an MCDA method, namely Electre I, to assess rules interestingness consider-
ing different viewpoints. Although some works previously mentioned tackle rule
selection using MCDA, and few approaches have been addressing ARM problem
using evidence theory, none of them is addressing both issues simultaneously.
We also present how to benefit from a priori knowledge about attribute val-
ues -organised into taxonomies- for improving the rule selection process, and
reducing the increase of complexity induced by the proposed extension of mod-
ellings used so far in existing ARM approaches suited for imperfect data.

3 Proposed Approach

This section presents our ARM approach for imperfect data. We first introduce
how rule interestingness is evaluated by presenting the selected measures and
their formalization in the evidence theory framework. Then, the main steps of
the proposed approach for selecting rules based on these measures are detailed.

3.1 Assessing Rule Interestingness from Imprecise Data

In this study, we focus on important objective measures of interestingness -


subjective ones, involving further interactions with final user, are most often
considered context-dependent and will not be considered in this paper. We pro-
pose to evaluate rules according to (i) their support, (ii) their confidence, as well
as (iii) indirect evaluations used to criticize their potential relevance. In addition,
112 C. L’Héritier et al.

since in our context rules are imprecise, and since very imprecise rules are most
often considered useless, the (iv) degree of imprecision embedded in the mined
rules is also evaluated. These four notions of interest considered in the study are
defined below. For convenience, we consider
 that we are computing
 measures to
evaluate a rule r : A → B where A = Ai , Ai ⊆ Θi and B = Bj , Bj ⊆ Θj
i∈I1 j∈I2
with I1 ∪ I2 = N . In our context, since we consider n = |N | attributes, the
set functions
 mass m, belief Bel and plausibility P l are defined on subsets of
Θ= Θi .
i∈N

Support. A rule is said to be supported if observations of its realization are


frequent [2]. In our context, the support of a rule relates to the masses of evidence
associated to observations supporting the rule, either explicitly or implicitly. The
belief function is thus used to express support:

supp(r : A → B) = Bel(A × B) (2)

Note that the belief function is monotone, then, the rules composed of the
most imprecise attribute values will necessarily be the most supported.

Confidence. A rule is said to be reliable if the relationship described by the rule


is verified in a sufficiently great number of applicable cases [12]. The Confidence
measure is traditionally evaluated as a conditional probability [1]. Its natural
counterpart in evidence theory is given by the conditional belief, leading to the
following expression:
Bel(A × B)
conf (r : A → B) = Bel(B | A) = (3)
Bel(A × B) + P l(A × B)
The elements defining the consequent are conditioned to the elements composing
the antecedent. Note that the belief and conditional belief functions have also
been adopted to express support and confidence for ARM with imprecise data
[13,24]. In those cases the modelling and domain definition were different, i.e.
restricted to the cartesian products of the power-sets of attribute domains.

Indirect Measures of Potential Relevance. These measures will be intro-


duced through an illustration. Consider humanitarian projects described by
two attributes: the transport means with Θ1 = {truck, motorbike, helicopter},
and the final coverage reached in the project (proportion of beneficiaries), with
Θ2 = {low, moderate, high}. To criticize the relevance of a rule r : A → B, e.g.
r : {truck} → {high}, we propose to evaluate the following relations:

– A → B. In the example, if the rule {truck} → {high} holds, it means that


most often using trucks also leads to a coverage that is not high. Hence we
consider that validating A → B conveys a contradictory information w.r.t. to
the rule A → B and tends to invalidate it.
Selecting Relevant Association Rules From Imperfect Data 113

– A → B. If the rule {truck} → {high} holds, it means that in some cases,


some of the other means of transport also allow to reach a high coverage.
Such an information tends to decrease the interest of the rule r : A → B if
we assume that B is not explained by multiple causes.
– A → B. The rule {truck} → {high} means that when trucks are not used,
a low or moderate coverage (not high) is obtained. We assume that most
commonly, if {truck} → {high} is somehow assumed to be considered as
valid, supporting {truck} → {high} will reinforce our interest over {truck} →
{high}.
In a probabilistic framework, only the relationship A → B would have to
be studied, since the other ones do not provide additional information, i.e.
P (B|A) = 1 − P (B|A), P (B|A) = 1 − P (B|A), P (A × B) = P (A)P (B|A)
and P (A × B) = (1 − P (A))P (B|A). Thus, the potential relevance of a rule
takes into consideration the confidence of the rule composed of the complements
of the antecedent and the consequent, given by: P (B|A). Note that, in the liter-
ature, this measure is also referred to as specificity. When considering evidence
theory, the information about the complement is provided by the plausibility
function, such as Bel(A) = 1 − P l(A) and then Bel(B|A) = 1 − P l(B|A). In this
context, Table 1 introduces the relationships between the confidence of a rule
(conditional belief) and the ones involving the complement of its antecedent
and/or consequent.
Note that to criticize the relevance of a rule using the three rules involving
its complements, we propose to consider their respective support and confidence:
criticizing a rule on the basis of weakly supported rules would not be appropriate.

Table 1. Relationships between support and confidence of a rule r : A → B and rules


involving its complements.

Rule Support Confidence Depends on quantities


A → B Bel(A × B) Bel(B | A) Bel(A × B) and P l(A × B)
A → B Bel(A × B) Bel(B | A) = 1 − P l(B | A) Bel(A × B) and P l(A × B)
A → B Bel(A × B) Bel(B | A) = 1 − P l(B | A) Bel(A × B) and P l(A × B)
A → B Bel(A × B) Bel(B | A) Bel(A × B) and P l(A × B)

Specificity Using Information Content. Finally, we propose to incorporate


the specificity of a rule. Let’s consider the information “the value of attribute
i is in the subset Ai ”. This information is more specific than the information
 
“the value of attribute i is in the subset Ai ” where Ai ⊂ Ai . Based on the
notion of Information Content (IC) defined for comparing concept specificities
in ontologies [20], we propose to quantify the specificity of a rule r by:
log |{X : X ⊆ A × B}|
IC(r : A → B) = 1 − (4)
|Θ|

|X| denotes the number of elements in the set X and Θ = Θi .
i∈N
114 C. L’Héritier et al.

3.2 Search Space Reduction

Let us remind the starting set R -see Formula (1)- of rules from which a small
subset R∗ of interesting rules should be selected:
 
R = {r : A → B | A = Ai , Ai ⊆ Θi , B = Bj , Bj ⊆ Θj }
i∈I1 j∈I2

We assume that I1 and I2 are fixed before starting the ARM process.
To simplify notations in the rest of the paper, we will denote by rA,B the
rule r : A → B where A and B are as in the Formula (1). Two restrictions are
proposed below:

1. All rules being supported are generalizations (supersets) of focal elements F,


i.e. F = {X : X ⊆ Θ, m(X) > 0}. Since support is a prerequisite for assessing
rule validity, we further consider that the evaluation will be restricted to the
set:
Rr = {rA,B ∈ R | ∃X ∈ F st. X ⊆ A × B}
2. The search space can also be reduced using prior knowledge defined into
ontologies expressing taxonomies of attribute values. Since the ontology
defines the concepts of interest for a domain, a restriction can be performed
only considering the attribute values defined into taxonomies. Thus, for each
i ∈ N , only a subset Oi of 2Θi of the information of interest for a domain is
considered. We can then define the following restriction:
 
Rr,t = {rA,B ∈ Rr |A = Ai , Ai ∈ Oi , B = Bj , Bj ∈ Oj }
i∈I1 j∈I2

Table 2. Summary of interestingness measures considered in the selection process

k ∈ K Measures Formulae ∀r ∈ Rr,t r : A → B Variation Weight


1 Rule Support supp(r) = Bel(A × B) Maximize w1
2 Rule Confidence conf (r) = Bel(B|A) Maximize w2
3 Rule Specificity IC(r) Maximize w3
4 A→B Bel(A × B) Minimize w4
5 Bel(B|A) Minimize w5
6 A→B Bel(A × B) Minimize w6
7 Bel(B|A) Minimize w7
8 A→B Bel(A × B) Maximize w8
9 Bel(B|A) Maximize w9
Selecting Relevant Association Rules From Imperfect Data 115

3.3 Rules Selection Process


The proposed approach aims at selecting the most relevant rules R∗ according to
their evaluations on a set of interestingness measures listed in Table 2. We here
consider that the evaluated rules are members of the restriction Rr,t ⊆ R, even
if that condition could further be relaxed. We denote the set of interestingness
measures by K (|K| = 9), and gk (r) the score of rule r for the measure k ∈ K.
To simplify notations, we consider that gk (r) is to maximize2 for all k ∈ K. A
two-step pruning strategy is proposed.

Step 1: Dominance-Based Pruning. A reduction of the concurrent rules


in Rr,t is carried out by focusing on non-dominated rules on the basis of the
considered measures. A rule r1 dominates a rule r2 , we write r2 ≺ r1 , iff r1 is
at least equal to r2 on all measures and it exists a measure where r1 is strictly
superior to r2 . More formally,

r2 ≺ r1 iff gk (r2 ) ≤ gk (r1 ), ∀k ∈ K and ∃j ∈ K such that gj (r2 ) < gj (r1 ).

The reduced set of rules can be stated as:

Rr,t,d = {r ∈ Rr,t | r ∈ Rr,t : r ≺ r }

Step 2: Pruning Using Electre I. When Rr,t,d remains too large to be man-
ually analyzed, a subjective pruning procedure based on the selection procedure
Electre I is applied. This MCDA method enables expressing subjectivity through
parameters that can be given by decision makers [18]. We use it for finding the
final set of rules R∗ ⊆ Rr,t,d . Electre I builds an outranking relation between
pairs of rules allowing to select a subset of the best rules: R∗ . This subset is such
that (i) any rules excluded from Rr,t,d is outranked by at least one rule from
R∗ , (ii) rules from R∗ do not outrank each other. To do so, Electre I procedure
(a) constructs outranking relationships through pairwise comparisons of rules,
to further (b) exploit those relationships to build R∗ .

(a) Outranking relations: the relationship “r outranks r ” (rSr ) means that


r is at least as good as r on the set of measures K. The outranking assertion
rSr holds if: (i) a sufficient coalition of measures supports it, and (ii) none of
the measures is too strongly opposed to it. These conditions are respectively
referred to as concordance
 c(rSr ) and discordance indices d(rSr ), such that:

c(rSr ) = wk and d(rSr ) = max  [gk (r ) − gk (r)],
{k : gk (r)≥gk (r  )} {k : gk (r)<gk (r )}
with wk the relative importance of measure k.
From these notations, we consider rSr if c(rSr ) ≥   with 
c and d(rSr ) ≤ d; c

and d, two thresholds defining when the outranking should be considered or not.
2
Indeed all the measures used in our approach take values in the interval [0, 1], then
a measure k to minimize can be changed to a measure to maximize by considering
1 − gk (r) instead of gk (r).
116 C. L’Héritier et al.

(b) Relations exploitation: a graph of outranking relationships is obtained


from these pairwise comparisons. The kernel of this graph is our final reduced
set of rules R∗ to be considered, such that:

− ∀r ∈ Rr,t,d \ R∗ , ∃ r ∈ R∗ : rSr , and


− ∀(r, r ) ∈ R∗ × R∗ , ¬(rSr ). (5)

The set of model parameters that have to be defined for applying the subjective
reduction based on Electre I are: weights wk , ∀k ∈ K, and the concordance and
discordance thresholds,   3 The choice of parameter values will be further
c, d.
discussed in the illustration Sect. 4.

4 Illustration
As an illustration, we consider the context of humanitarian projects carried out
for answering to emergency situations. A dataset of observations describes these
emergency situations according to four attributes: (1) the type of disaster faced,
(2) the season, (3) the environment in which it occurred, and (4) an evaluation
of the situation w.r.t. the human cost. We further refer to these attributes using

Table 3. Database of observations expressed using precise, imprecise or missing values.

Disaster type Season Environment Human cost


d1 {earthquake} {autumn} {rural} {medium}
d2 {tsunami} {autumn} {urban} {medium}
d3 {epidemic} - {urban} {veryHigh}
d4 {earthquake, epidemic, tsunami} {spring} - {high, veryHigh}
d5 {epidemic} {spring} {urban} {high}
d6 {epidemic} {spring, summer} - {high, veryHigh}
d7 {epidemic} {spring, summer} {urban} {high, veryHigh}
d8 {epidemic} {spring, summer} {urban} {veryHigh}
d9 {earthquake, epidemic, tsunami} {summer} {rural} {high}
d10 {epidemic} {summer} {urban} {high}
d11 {epidemic} {summer} {urban} {veryHigh}
d12 {earthquake} {winter} {rural} {high, medium, veryHigh}
d13 {earthquake} {winter} {rural} {low}
d14 {earthquake, epidemic, tsunami} {winter} {rural} {high}

3
Evaluating support and confidence of A → B and A → B can lead to undefined
values, e.g. evaluating A → B, we have Bel(A × B) = 0 when A has never been
observed, leading to Bel(B|A) being undefined. However, pruning using dominance
and Electre I requires the same measures to be defined. Undefined values are thus
substituted by an arbitrary value that neither favor nor penalize the evaluation of
the rule A → B. The median of Bel(A × B) (resp. Bel(A × B)) has been chosen.
Note that A → B is not concerned since evaluating A → B implies evidence on A.
Selecting Relevant Association Rules From Imperfect Data 117

their number, considering that they respectively take discrete values in: Θ1 =
{tsunami, earthquake, epidemic, conflict, pop.displacement}, Θ2 = {spring, sum-
mer, autumn, winter }, Θ3 = {urban, rural }, Θ4 = {low, medium, high, very-
High}. Besides, for each attribute, prior knowledge is defined into ontologies
determining the values of interest. In this specific case study, the purpose of
association rules is to highlight the influence of a situation contextual features
on its evaluation according to the Human Cost, a useful information for project
planning. Thus the searched rules r : A → B will imply the attributes in the
following set I1 = {1, 2, 3} in the antecedent and in I2 = {4} for the consequent.

Table 4. Set of non-dominated rules, Rr,t,d .

Disaster Type Season Environment Human cost

r0 : {earthquake} ∧ {autumn} ∧ {rural} → {medium}


r1 : {earthquake, tsunami} ∧ {autumn} ∧ Θ3 → {medium}
r2 : {tsunami} ∧ {autumn} ∧ {urban} → {medium}
r3 : {earthquake, epidemic, tsunami} ∧ Θ2 ∧ Θ3 → Θ4
r4 : {earthquake, epidemic, tsunami} ∧ Θ2 ∧ Θ3 → {high, medium, veryHigh}
r5 : {earthquake, epidemic, tsunami} ∧ Θ2 ∧ Θ3 → {high, veryHigh}
r6 : {epidemic} ∧ Θ2 ∧ Θ3 → {high, veryHigh}
r7 : {epidemic} ∧ Θ2 ∧ {urban} → {veryHigh}
r8 : {earthquake} ∧ {autumn, winter} ∧ {rural} → {medium}
r9 : {earthquake, tsunami} ∧ {autumn, winter} ∧ Θ3 → {low, medium}
r10 : {earthquake, tsunami} ∧ {autumn, winter} ∧ Θ3 → {medium}
r11 : {earthquake, epidemic, tsunami} ∧ {spring, summer} ∧ Θ3 → {high, veryHigh}
r12 : {epidemic} ∧ {spring, summer} ∧ Θ3 → {high, veryHigh}
r13 : {epidemic} ∧ {spring, summer} ∧ {urban} → {high, veryHigh}
r14 : {epidemic} ∧ {spring, summer} ∧ {urban} → {veryHigh}
r15 : {epidemic} ∧ {summer} ∧ {urban} → {high, veryHigh}
r16 : {epidemic} ∧ {summer} ∧ {urban} → {veryHigh}
r17 : {earthquake} ∧ {winter} ∧ {rural} → {low}

Among the observations of 14 projects given in Table 3, some attribute values


are expressed with imprecision, e.g. Human cost values may be unclear such
that “human Cost is High or VeryHigh”. When values are missing the total
ignorance is considered. In this setting, the size of the initial studied space R
4
is 2|Θi \∅| = 20925. Using the restrictions focusing on rules with non-null
i=1
support, and involving attribute values of interest defined into ontologies (cf.
Sect. 3), we obtain a reduced search space Rr,t composed of 484 rules.
The rule evaluation and selection process is further applied to Rr,t using the
9 interestingness measures proposed in Table 2. Using dominance-based pruning
(Step 1/2), a set of 18 non-dominated rules Rr,t,d is identified among the 484
rules initially considered. These rules are listed in Table 4, and indexed from r0
to r17 . Pruning using Electre I is then applied over the set of non-dominated
rules Rr,t,d (Step 2/2). Different sets of selected rules -i.e. R∗ - are given in
Table 5 for different sets of model parameters. The results being sensitive to
parameter values, we propose to discuss different parameter settings. We remind
118 C. L’Héritier et al.

that these parameters are: ∀k ∈ K, wk the weights of interestingness measures,


and c and d the concordance and discordance thresholds. They represent end-
user’s preferences. They can be given directly; the weights wk can also be elicited
using Simos, a well-known weighting procedure [11].

Table 5. Final sets of rules (R∗ ) obtained with Electre I pruning using four parameter
settings (a to e).

Different sets of parameters, with 


c = 0.7
w1 w2 w3 w4 w5 w6 w7 w8 w9 d R∗
a 0.27 0.15 0.1 0.08 0.08 0.08 0.08 0.08 0.08 0.3 {r1 , r3 , r6 , r9 , r11 }
b 0.18 0.18 0.18 0.1 0.1 0.1 0.1 0.03 0.03 0.3 {r1 , r3 , r6 }
0.2 {r0 , r1 , r2 , r3 , r6 , r13 , r16 , r17 }
c 0.12 0.2 0.2 0.08 0.08 0.08 0.08 0.08 0.08 0.3 {r1 , r3 , r6 }
0.2 {r0 , r1 , r2 , r3 , r6 , r13 , r16 , r17 }
d 0.15 0.25 0.25 0.05 0.05 0.05 0.05 0.075 0.075 0.3 {r1 , r3 , r6 , r17 }
0.2 {r0 , r1 , r2 , r3 , r6 , r13 , r16 , r17 }
e 0.33 0.33 0.34 0 0 0 0 0 0 0.3 Rr,t,d \ {r8 , r10 , r16 , r17 }

Among the considered interestingness measures, according to the literature,


we assume that support, confidence and IC are the most significant ones w.r.t.
rule interest. They have to be associated to the most important weights. Con-
versely, we assume that the other measures -about rule complements- are sec-
ondary and will provide additional information for comparing and criticizing the
relevance of rules. In the first set of parameters (a) (cf. Table 5), the weight given
to support and confidence is maximized to represent 60% of the votes required
for the outranking (to exceed  c = 0.7). This setting will tend to favor the rules
having a high degree of imprecision, being well supported and then reliable, since
Bel(B|A) ≥ Bel(A × B). For example, in this setting the rules r3 , r6 , r11 , see
Tables 4 and 5, are among the selected rules in R∗ ; e.g. with r3 involving the
total imprecision on three attributes.
When restricting d to 0.2 with the parameter settings (b), (c), (d), it increases
the size of the kernel, while still discarding more than half of the rules among the
set of non-dominated ones. With parameters (d) and d = 0.3, highest importance
is given to confidence and IC, providing these 2 measures with 71% of the voting
power to reach the outranking condition  c = 0.7. Thus, a rule with a better score
on confidence, IC and on some of the other measures -except support- can be
selected while having a low support. This is illustrated with the selection of r17
for example. Lastly, the parameter setting (e) is equivalent to considering only
the three main measures with equal importance. Here, it enables to discard only
4 extra rules in comparison to dominance relationships. This is explained by the
fact that the absence of dominance between rules is more frequent.
Finally, the parameter settings (b), (c) or (d) with d = 0.2, favoring the
support, confidence and IC over the other measures tend to provide interesting
results. This setting enables the selection of both precise and imprecise rules of
Selecting Relevant Association Rules From Imperfect Data 119

interest w.r.t. the initial set of observations, such as r16 and r13 . In the initial
dataset -see Table 3- the imprecise information {spring, summer} for the sea-
son or {high, veryHigh} for the Human Cost are frequently observed. Indeed,
selecting the imprecise rule r13 : {epidemic} ∧ {spring, summer} ∧ {urban} →
{high, veryHigh} in R∗ is not surprising. As an interpretation of this rule,
we say that the analysis of the database tends to relate the occurrence of epi-
demics in urban areas to a specific season, spring or summer, and human cost.
In particular, the rule seems valid at least for one the conjunction “summer and
high human cost”, “summer and a very High human cost”, “spring and high”
or “spring and veryHigh”. In this illustration, different sets of parameters and
their results on rule selection have been presented. However, these parameters
have to be set by the end-user.
To further discuss these results, it is interesting to note that all the selected
measures for rules comparison, except the IC, are based on observations fre-
quency. In order to counterbalance the preponderance of this factor, it might
be relevant to add subjective measures and not only data-driven ones. Subjec-
tive interestingness measures have been studied in the literature. Relying on
these works, we could include here measures based for example on user expected
rules or expected conjunction of attribute values. Furthermore, investigating the
dependencies among frequency based measures, and considering them in the
selection process will be valuable. Nevertheless, considering additional measures
(especially data-driven), as the ones proposed for classical ARM, is not neces-
sarily straightforward within the evidence theory framework. It indeed implies
to define their right expression and meaning in this framework.

5 Conclusion and Perspectives

Mining association rules from imperfect data is a key challenge for real-world
applications dealing with imperfect data, e.g., imprecise, missing data, etc. The
ARM approach introduced in this paper enables to deal with imprecise data and
derive imprecise rules under specific conditions (e.g. fixing both antecedent and
consequent). Relying on evidence theory and Multiple Criteria Decision Anal-
ysis, this new framework enriches expressivity of existing works while provid-
ing a novel selection procedure for identifying most interesting rules according
to several viewpoints. To this aim, several interestingness measures have been
proposed, and used in a two-step selection procedure based on dominance rela-
tionships and Electre I. A restriction using a priori knowledge has also been
proposed to focus and ease the mining process by incorporating symbolic knowl-
edge defined into domain ontologies. To further improve the approach, additional
measures of interestingness could be added. Future work related to subjective
measures (e.g., user-oriented) would be particularly relevant to enrich the set of
frequency-based measures that are currently involved in the approach. Studying
the interactions between the measures would also be of interest. Finally, only
an illustration using a simplified case study related to humanitarian projects
analysis has been presented in this paper. Thorough algorithmic complexity and
120 C. L’Héritier et al.

performance evaluations of the approach have to be discussed. Difficult chal-


lenges related to algorithmic complexity and efficiency issues of the procedure
also have to be addressed in order to mine rules involving numerous attributes.

References
1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of
items in large databases. In: ACM SIGMOD Record, vol. 22, pp. 207–216. ACM
(1993)
2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In:
Proceedings of 20th International Conference on Very Large Data Bases, VLDB,
vol. 1215, pp. 487–499 (1994)
3. Ait-Mlouk, A., Gharnati, F., Agouti, T.: Multi-agent-based modeling for extract-
ing relevant association rules using a multi-criteria analysis approach. Vietnam J.
Comput. Sci. 3(4), 235–245 (2016)
4. Bouker, S., Saidi, R., Yahia, S.B., Nguifo, E.M.: Ranking and selecting association
rules based on dominance relationship. In: 2012 IEEE 24th International Confer-
ence on Tools with Artificial Intelligence, vol. 1, pp. 658–665. IEEE (2012)
5. Chen, M.C.: Ranking discovered rules from data mining with multiple criteria by
data envelopment analysis. Expert Syst. Appl. 33(4), 1110–1116 (2007)
6. Choi, D.H., Ahn, B.S., Kim, S.H.: Prioritization of association rules in data mining:
multiple criteria decision approach. Expert Syst. Appl. 29(4), 867–878 (2005)
7. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
Ann. Math. Stat. 38, 325–339 (1967)
8. Djouadi, Y., Redaoui, S., Amroun, K.: Mining association rules under imprecision
and vagueness: towards a possibilistic approach. In: 2007 IEEE International Fuzzy
Systems Conference, pp. 1–6. IEEE (2007)
9. Dubois, D., Denoeux, T.: Conditioning in dempster-shafer theory: prediction vs.
revision. In: Denoeux, T., Masson, M.H. (eds.) Belief Functions: Theory and Appli-
cations, pp. 385–392. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-29461-7 45
10. Fagin, R., Halpern, J.Y.: A new approach to updating beliefs. In: Proceedings of
the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI 1990,
pp. 347–374. Elsevier Science Inc., New York, NY, USA (1991). http://dl.acm.org/
citation.cfm?id=647233.760137
11. Figueira, J., Roy, B.: Determining the weights of criteria in the electre type methods
with a revised simos’ procedure. Eur. J. Oper. Res. 139(2), 317–326 (2002)
12. Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM
Comput. Surv. 38(3), 9-es (2006)
13. Hewawasam, K., Premaratne, K., Subasingha, S., Shyu, M.L.: Rule mining and
classification in imperfect databases. In: 2005 7th International Conference on
Information Fusion, vol. 1, p. 8. IEEE (2005)
14. Hong, T.P., Lin, K.Y., Wang, S.L.: Fuzzy data mining for interesting generalized
association rules. Fuzzy Sets Syst. 138(2), 255–269 (2003)
15. Kotsiantis, S., Kanellopoulos, D.: Association rules mining: a recent overview.
GESTS Int. Trans. Comput. Sci. Eng. 32(1), 71–82 (2006)
16. Liu, B., Hsu, W., Chen, S., Ma, Y.: Analyzing the subjective interestigness of
association rules. IEEE Intell. Syst. 15(5), 47–55 (2000). https://doi.org/10.1109/
5254.889106
Selecting Relevant Association Rules From Imperfect Data 121

17. Nguyen Le, T.T., Huynh, H.X., Guillet, F.: Finding the most interesting association
rules by aggregating objective interestingness measures. In: Richards, D., Kang, B.-
H. (eds.) PKAW 2008. LNCS (LNAI), vol. 5465, pp. 40–49. Springer, Heidelberg
(2009). https://doi.org/10.1007/978-3-642-01715-5 4
18. Roy, B.: Classement et choix en présence de points de vue multiples. Revue
française d’informatique et de recherche opérationnelle 2(8), 57–75 (1968)
19. Samet, A., Lefèvre, E., Yahia, S.B.: Evidential data mining: precise support and
confidence. J. Intell. Inf. Syst. 47(1), 135–163 (2016)
20. Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic
similarity in wordNet. In: Ecai, vol. 16, p. 1089 (2004)
21. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
22. Silberschatz, A., Tuzhilin, A.: What makes patterns interesting in knowledge dis-
covery systems. IEEE Trans. Knowl. Data Eng. 8(6), 970–974 (1996)
23. Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for
association patterns. In: Proceedings of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 32–41. ACM (2002)
24. Tobji, M.B., Yaghlane, B.B., Mellouli, K.: A new algorithm for mining frequent
itemsets from evidential databases. Proc. IPMU 8, 1535–1542 (2008)
25. Bach Tobji, M.A., Ben Yaghlane, B., Mellouli, K.: Frequent itemset mining from
databases including one evidential attribute. In: Greco, S., Lukasiewicz, T. (eds.)
SUM 2008. LNCS (LNAI), vol. 5291, pp. 19–32. Springer, Heidelberg (2008).
https://doi.org/10.1007/978-3-540-87993-0 4
26. Toloo, M., Sohrabi, B., Nalchigar, S.: A new method for ranking discovered rules
from data mining by dea. Expert Syst. Appl. 36(4), 8503–8508 (2009)
27. Vaillant, B., Lenca, P., Lallich, S.: A clustering of interestingness measures. In:
Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 290–297.
Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30214-8 23
Evidential Classification of Incomplete
Data via Imprecise Relabelling:
Application to Plastic Sorting

Lucie Jacquin1(B) , Abdelhak Imoussaten1(B) , François Trousset1(B) ,


Jacky Montmain1(B) , and Didier Perrin2(B)
1
LGI2P, IMT Mines Ales, Univ Montpellier, Ales, France
{lucie.jacquin,abdelhak.imoussaten,francois.trousset,
jacky.montmain}@mines-ales.fr
2
C2MA, IMT Mines Ales, Univ Montpellier, Ales, France
[email protected]

Abstract. Besides ecological issues, the recycling of plastics involves


economic incentives that encourage industrial firms to invest in the field.
Some of them have focused on the waste sorting phase by designing
optical devices able to discriminate on-line between plastic categories.
To achieve both ecological and economic objectives, sorting errors must
be minimized to avoid serious recycling problems and significant qual-
ity degradation of the final recycled product. Even with the most recent
acquisition technologies based on spectral imaging, plastic recognition
remains a tough task due to the presence of imprecision and uncertainty,
e.g. variability in measurement due to atmospheric disturbances, age-
ing of plastics, black or dark-coloured materials etc. The enhancement
of recent sorting techniques based on classification algorithms has led
to quite good performance results, however the remaining errors have
serious consequences for such applications. In this article, we propose
an imprecise classification algorithm to minimize the sorting errors of
standard classifiers when dealing with incomplete data, by both integrat-
ing the processing of classification doubt and hesitation in the decision
process and improving the classification performances. To this end, we
propose a relabelling procedure that enables better representation of the
imprecision of the learning data, and we introduce the belief functions
framework to represent the posterior probability provided by a classifier.
Finally, the performances of our approach compared to existing imprecise
classifiers is illustrated on the sorting problem of four plastic categories
from mid-wavelength infra-red spectra acquired in an industrial context.

Keywords: Machine learning · Imprecise classification · Reliable


classification · Belief functions · Plastic separation

1 Introduction
Plastic recycling is a promising alternative to landfills for dealing with the
fastest growing waste stream in the world [8]. However, for physiochemical rea-
sons related to non-miscibility between plastics, most plastics must be recycled
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 122–135, 2019.
https://doi.org/10.1007/978-3-030-35514-2_10
Evidential Classification of Incomplete Data 123

separately. Plastic category identification is therefore a major challenge in the


recycling process. With the emergence of hyperspectral imaging, some indus-
trial firms have designed sorting devices able to discriminate between several
categories of plastics based on their absorption or transmittance spectra. The
sorting process is generally performed using supervised classification, which has
been well developed with the emergence of computer sciences and data science
[18,22,38]. The classification performance might be affected by several issues
such as noise or overlapping regions in the feature space [21,34]. The latter
problem occurs when samples from different classes share very similar char-
acteristics. We are particularly faced with these problems when attempting to
classify industrially acquired spectra. Indeed, in an industrial context, the acqui-
sition process is subject to technical and financial constraints to ensure through-
put and financial competitiveness. For this reason one cannot expect the same
quality of data as for equivalent laboratory measures. Several issues imply the
presence of imprecision and uncertainty in the acquired spectra: (i) the avail-
able spectral range might be insufficient; (ii) the plastic categories to be recycled
are chemically close; (iii) atmospheric perturbations may cause noise; (iv) plastic
ageing and plastic additives are known to change spectral information; (v) impu-
rities like dust deposits or remains of tags will also produce spectral noise. As
in solving many other decision-making problems, classification errors may have
serious consequences, e.g., medical diagnosis applications. Regarding plastic sort-
ing, identification errors will cause serious recycling difficulties and significant
degradation of the secondary raw material performances and thus quality degra-
dation of the recycled products. Usually, the problem of plastic identification
is treated using standard classification algorithms that are designed to produce
point predictions, i.e., a single plastic category. In cases of imperfect data, stan-
dard classifiers become confused and inevitably commit errors. This brings us
to consider alternative representations of the information that take into account
imprecision and uncertainty to achieve more accurate classification. Modern the-
ories of uncertainty such as fuzzy subsets [35], possibility theory [14], imprecise
probabilities [33] or belief functions [26,30] offer better representations of the
data-imperfection of information. Several classification algorithms have been
proposed in these frameworks. Most of them are extensions of standard algo-
rithms. We can cite the fuzzy version of the well known k-means algorithm [15],
fuzzy and evidential versions of k-Nearest Neighbour (k-NN) [10,19] or some
fuzzy and evidential revisions of neural network algorithms [4,11].
In this paper we consider the case where the original imperfections come from
data features only. Available training example labels are precise and considered
trustworthy, e.g., based on laboratory measures and expertise. In order to bet-
ter represent all available information, we think that labels should conform with
the feature imprecision. If an object of class θ1 has its vector of features x in
the overlapping region θ1 and θ2 , then the example should be relabelled by the
set {θ1 , θ2 }. In order to achieve such representation we propose to relabel each
training example in accordance with their discriminatory nature. New labels are
therefore subsets of the original set of classes. This imprecise relabelling would
124 L. Jacquin et al.

better represent the learning data by mapping overlaps in the feature space. The
resulting imprecise label can be naturally treated in the belief functions theory
context. Indeed, belief functions theory [26] is an interesting framework for rep-
resenting imprecise and uncertain data by allowing the allocation of a probability
mass for imprecise data. Thus, imprecision and ignorance is better captured in
this framework compared to the probability framework where equiprobability
and imprecision are confused. The recent growing interest in this theory has
allowed techniques to be developed for resolving a diverse range of problems
such as estimation [12,17], standard classification [10,32], or even hierarchical
classification [1,23].
Our proposed approach, called Evidential CLAssification of incomplete data
via Imprecise Relabelling (ECLAIR), is based on a relabelling procedure of the
training examples that enables better representation of the missing information
about some data features. Then a classifier is trained on the relabelled data
producing a posterior mass function. With imprecise relabelling we try to quan-
tify, using a mass function, the extend to which a subsets of classes is reliable
and relevant as output for a new data. In other words, we look for the set of
classes which any more precise subset output would lead inevitably to an error.
The resulting classification algorithm can enhance the classification accuracy as
well as cope with difficult examples by allowing less precise but more reliable
classification output which will optimize the recycling process.
The remainder of this paper is organized as follows: Sect. 2 sets down the
main notations and provides a reminder on supervised classification and elements
of belief functions theory; in Sect. 3 we present the proposed approach; Sect. 4
briefly describes the related works; Sect. 5 presents results of experimentation
on the sorting problem of four plastics.

2 Theoretical Background
Classification is a technique allowing to assign objects to categories from the
observations of several of their characteristics. A classifier is a function that maps
an object represented by its values of characteristics on a finite set of variables,
to a category represented by a value of a categorical variable. More precisely, let
us consider a set of n categories represented by a set Θ = {θ1 , θ2 , . . . , θn }, also
refereed as a set of labels or classes. In the framework of belief functions Θ is
called a frame of discernment. Each θj , j ∈ {1, ..., n} denotes a singleton which
represents the lowest level of discernible information in Θ. Let us denote by
X1 , X2 , . . . , Xp , p variables where the taken values represent the characteristics,
also called attributes or features, of the objects, to be classified. In the rest of
the paper we refer to Θ as a set of classes and to (X1 , X2 , . . . , Xp ) as a vector
of features where ∀i ∈ {1, . . . , p}, Xi refers both to the name of the feature and
to the space of the values taken by the feature, i.e., Xi ⊆ R. For an object x

p
belonging to X = Xi ⊆ Rp , let θ(x) ∈ Θ denote the unknown label that
i=1
should be associated to x.
Evidential Classification of Incomplete Data 125

In this article, we focus on a supervised classification problem. The specificity


of the considered data, referred to as incomplete data, is that some features of
some examples are missing due to technological aspects. Therefore, only part of
the data of these examples is obtained. The proposed classification approach,
qualified as imprecise, integrates the incompleteness of the data in its process to
predict subsets of classes comprising the true class when standard counterpart
classifier would have predicted the wrong class. To this aim we diverted standard
probabilistic classifiers from their natural use for computing probability on sets of
classes. Such uncertain resulting information is then captured by belief functions.
The following subsections, briefly recalls the notions discussed.

2.1 Supervised Classification


To determine θ(x) in a supervised classification manner, a standard classifier
δΘ : X → Θ is trained on a set of examples (xi , θi )1≤i≤N such that for all
1 ≤ i ≤ N , xi belongs to X and θi to Θ. By standard classifier we mean a classi-
fier that assigns to x a single class θ(x) = θj , j ∈ {1, . . . , n}. In some cases when
the input data is too voluminous or redundant, it may be appropriate to perform
some extraction features before the training of δΘ . By reducing the dimension

of X , and thus, working with a reduced feature space X  ⊆ Rp with p < p,
the extraction such as Principal Component Analysis (PCA), Linear Discrimi-
nant Analysis (LDA) or Independent Component Analysis (ICA) facilitates the
learning and may enhance the classification performance. When feature extrac-
tion is designed taking into account the labels of the training examples it is
termed as supervised feature extraction. For instance LDA also known as Fisher
discriminant analysis reduces the number of features to n−1 by looking for a lin-
ear combination of the variables maximizing the within-groups and minimizing
between-groups variance.

2.2 Probabilistic Classifier and Decision Rule


When δΘ can also provide for x a posterior probability distribution p(.|x) : Θ →
[0, 1], it is called a probabilistic classifier. Many classifier algorithms base their deci-
sion only on p(.|x) as follows: θ(x) = arg max p(θj |x). For more sophisticated
j=1,...,n
decisions, one can use the decision rule technique classically used in decision the-
ory. Let A = {a1 , a2 , . . . , am } be a finite set of actions that can be taken. In the
case of a standard classifier, an action a ∈ A corresponds to assign a class θ ∈ Θ to
an object x. In such case, we simplify by setting A = Θ. In order to compare deci-
sions in A or to compare the classifier δΘ to another decision rule, two functions
are introduced: loss function and risk function. A loss function L : A × Θ → R is
considered to quantify the loss L(a, θ) incurred when choosing the action a ∈ A
while the true state of nature is θ ∈ Θ. A risk function rδΘ : A → R is defined as the
following expectation: rδΘ (a) = Ep(.|x) (L(a, θ)). In the case of discrete and finite
n
probability distribution, we have rδΘ (θj ) = L(θj , θk ) p(θk |x), j ∈ {1, . . . , n}.
k=1
Thus, considering the decision rule δΘ , the class θj minimizing the risk rδΘ (θj ) over
Θ should be chosen.
126 L. Jacquin et al.

2.3 Elements of Belief Functions Theory

Due to the additivity constraint inherent to the definition of a probability distri-


bution, one cannot built a probability distribution when measures, observations,
etc. are imprecise. Belief functions theory, as an extension of probability the-
ory, allows masses to be assigned to imprecise data. Two levels are considered
when introducing belief functions: credal and pignistic levels. At the credal level,
beliefs are captured and quantified by belief functions, while at the pignistic level
or decision level, beliefs are quantified using probability distributions.

Credal Level. A mass function, also calledbasic belief assignment (bba), is


a set function m : 2Θ → [0, 1] satisfying m(A) = 1. For a set A ⊆ Θ,
A⊆Θ
the quantity m(A) is interpreted as a measure of evidence committed exactly
to the set A and not to any more specific subsets of A. The elements A ∈ 2Θ
such that m(A) > 0 are called focal elements and they form a set denoted F.
(m, F) is called body of evidence. The total belief committed to A is measured
by the sum of all masses of A’s subsets.
 This is expressed by the belief function
Bel : 2Θ → [0, 1], Bel(A) = m(B). Furthermore the plausibility of
B⊆Θ,B⊆A
A, P l : 2Θ → [0, 1], quantifies
the maximum amount of support that could be
allocated to A, P l(A) = m(B).
B⊆Θ,B∩A=∅

Pignistic Level. In the transferable belief model [29], the decision is made in
the pignistic level. The evidential information is transferred into a probabilistic
framework by means
 of the pignistic probability distribution betPm , for θ ∈ Θ,
betPm (θ) = m(A)/|A|, where |A| denotes the number of elements in A.
A⊆Θ,A θ

Decision Rule. The risk associated with a decision rule is adaptable for the
evidential framework [9,13,27]. In the case of imprecise data, the set of actions
A is 2Θ \ {∅}. In order to decide between the elements of A according to the
chosen loss function L, it is possible to adopt different strategies. Two strategies
are proposed in the literature: the optimistic strategy by minimizing rδΘ or the
pessimistic strategy by minimizing rδΘ which are defined as follows:
 
r(A) = m(B) min L(A, θ), r(A) = m(B) max L(A, θ). (1)
θ∈B θ∈B
B⊆Θ B⊆Θ
Evidential Classification of Incomplete Data 127

3 Problem Statement and Proposed Approach


3.1 Imprecise Supervised Classification

For a new example x, the output of an imprecise classifier is a set of classes,


all its elements are candidates for the true class θ and the missing information
prevent more precise output. In this case a possible output of the classifier is
the information: “θ ∈ A”, A ⊆ Θ. To perform an imprecise classification, two
cases need to be distinguished related to the training examples: (case 1 ) learn-
ing examples are precisely labelled, i.e., only a single class is assigned to each
example; (case 2 ) one or more classes are assigned to each training example. In
the first case described in the Subsect. 2.1, standard classifiers give a single class
as prediction to a new object x but some recent classifiers [6,7,36] give a set of
classes as prediction of x. Some of these recent classifiers base their algorithm
on the posterior probability provided by standard classifiers. More precisely, if
we denote by P(.|x) the probability measure
 associated to the posterior prob-
ability distribution p(.|x), P(A|x) = p(θ|x), A ⊆ Θ is used to determine
θ∈A
the relevant subset of classes to be assigned to x. In the second case when the
imprecision or doubt is explicitly expressed by the labels, [2,5,37], a classifier
δ2Θ : X → 2Θ \ {∅} is trained on a set of examples (xi , Ai )1≤i≤N such that for
all 1 ≤ i ≤ N , xi belongs to X and ∅ = Ai ⊆ Θ. This case is refereed in our
paper as imprecise supervised classification.

3.2 Problem Statement

Let us consider the supervised classification problem where the available training
examples that are precisely labelled (case 1 ) (xi , θi )1≤i≤N , xi ∈ X and θi ∈ Θ are
such that (i) the labels θi=1,...,N are trusted. They may derive from expertise on
other features x∗i=1,...,N which contain more complete information than xi=1,...,N ,
(ii) this loss of information induces overlapping on some examples. In other
words, ∃i, j ∈ {1, ..., N } such that the characteristics of xi are very close to
those of xj but θi = θj . When a standard classifier is trained on such data, it
will commit inevitable errors. The problem that we handle in this paper is how
to improve the learning step to better consider this type of data and get better
performances and reliable predictions.

3.3 The Imprecise Classification Approach

The proposed approach of imprecise classification is constituted by three steps:


(i) the relabelling step which consists in analysing the training example in
order to add to the class that is initially associated to an example the classes
associated to the other examples having characteristics very close. Thus a new
set of examples is built: (xi , Ai )1≤i≤N such that for all 1 ≤ i ≤ N , xi belongs
to X and ∅ = Ai ⊆ Θ; (ii) the training step which consists on the training of
probabilistic classifier δ2Θ : X → 2Θ \ {∅}. The classifier δ2Θ provides for a new
128 L. Jacquin et al.

object x ∈ X a posterior probability distribution on 2Θ which is also a mass


function denoted m(.|x). The trained classifier ignores the existence of inclusion
or intersection between subsets of classes. This unawareness of relations between
the labels may seem counter intuitive, but is compatible with the purpose of
finding a potentially imprecise label associated to a new incoming example; (iii)
the decision step which consists of proposing a loss function adapted for the
case of imprecise classification that calculates the prediction that minimize the
risk function associated to the classifier δ2Θ . Figure 1 illustrates the global process
and the steps of relabelling, classification and decision are presented in detail
below.

Fig. 1. Steps of evidential classification of incomplete data

Relabelling Procedure. First we perform LDA extraction on the training


examples (cf Fig. 1) in order to reduce complexity. The resulting features are
xi ∈ Rn−1 , i = 1, ..., N where n = |Θ|. Then we consider a set of C standard
classifiers δΘ 1
, ..., δΘ
C
where on each classifier δΘ
c
: Rn−1 → Θ, c ∈ {1, ..., C} we
compute leave-one-out (LOO) cross validation predictions for the training data
(xi , θi )i=1,...,N .
The relabelling of the example (xi , θi ) is based on a vote procedure of the
LOO predictions of the C classifiers. The vote procedure is the following: when
more than 50% majority of the classifiers predict a class θmaji , the example is
relabelled as the union Ai = {θi , θmaji }. Note that when θmaji = θi the original
label remains, i.e., Ai = θi . If none of the predicted classes from the C classifiers
gets the majority, then the ignorance is expressed for this example by relabelling
it as Ai = Θ. Note that the new labels are consistent with the original classes
Evidential Classification of Incomplete Data 129

that were trusted. The fact that several (C) classifiers are used to express the
imprecision permits a better objectivity on the real imprecision of the features,
i,e, the example is difficult not only for a single classifier. We denote by A ⊆ 2Θ
the set of the new training labels Ai , i = 1, ..., N .
Note that we limited the new labels Ai to have at most two elements except
when expressing ignorance Ai = Θ. This is done for avoiding too unbalanced
training sets, but more general relabelling could be considered. Once all the
training examples are relabelled, a classifier δ2Θ can be trained.
Learning δ 2Θ . As indicated throughout this paper, δ2Θ is learnt using the new
labels ignoring the relations that might exist between the elements of A. Rein-
forcing the idea of independence of treatment between the classes, LDA is applied
to the relabelled training set (xi , Ai )i=1,...,N . This results to the reduction of the
space dimension from p to |A| − 1 which better expresses the repartition of rela-
belled training examples. For the training example i ∈ {1, ..., N }, let xi ∈ R|A|−1
be the new projection of xi on this |A| − 1 dimension space. The classifier δ2Θ is
finally taught on (xi , Ai )i=1,...,N .

Decision Problem. As recalled in Subsects. 2.2 and 2.3, the decision to assign a
new object x to a single class or a set of classes usually relies on the minimisation
of the risk function which is associated to a loss function L : 2Θ \ {∅} × Θ → R.
As mentioned in the introduction to this paper, the application of our work
concerns situations where errors may have serious consequences. It would then
be legitimate to consider the pessimistic strategy by minimizing rδΘ . Further-
more, in the definition of rδΘ , Eq. (1), the quantity max L(A, θ) concerns the loss
θ∈B
incurred by choosing A ⊆ Θ, when the true nature is comprised in B ⊆ Θ. On
the basis of this fact, we proposed a new definition of the loss function, L(A, B),
A, B ⊆ Θ, which directly takes into account the relations between A and B.
This is actually a generalisation of the definition proposed in [7] that is based
on F-measure, recall and precision for imprecise classification. Let us consider
A, B ∈ 2Θ \ {∅}, where A = θ(x) is the prediction for the object x and B is its
state of nature. Recall is defined as the proportion of relevant classes included
in the prediction θ(x). We define the recall of A and B as:

|A ∩ B|
R(A, B) = . (2)
|B|

Precision is defined as the proportion of classes in the prediction that are rele-
vant. We define the precision of A and B as:

|A ∩ B|
P (A, B) = . (3)
|A|

Considering these two definition, the F-measure can be defined as follows:

(1 + β 2 )P R (1 + β 2 )|A ∩ B|
Fβ (A, B) = = . (4)
β2P + R β 2 |B| + |A|
130 L. Jacquin et al.

Note that β = 0, induce Fβ (A, B) = P (A, B), whereas when β → ∞,


Fβ (A, B) → P (A, B). Let us comment on some situations according to the
β→∞
“true set” B and the predicted set A. The worse scenario of prediction is when
there is no intersection between A and B. This would always be sanctioned by
Fβ (A, B) = 0. On the contrary, when A = B, Fβ (A, B) = 1 for every β. Between
those extreme cases, the errors of generalisation i.e., B ⊂ A, are controlled by
the precision while the errors of specialisation i.e., A ⊂ B, are controlled by the
recall. Finally, the loss function Lβ : 2Θ \ {∅} × 2Θ \ {∅} → R is extended:
Lβ (A, B) = 1 − Fβ (A, B). (5)
For an example x to be classified, whose mass function m(.|x) has been calculated
by δ2Θ , we predict the set A minimizing the following risk function:

Riskβ (A) = m(B)Lβ (A, B). (6)
B⊆Θ

4 Related Works
Regarding relabelling procedures, much research has been carried out to identify
suspect examples with the intention to suppress or relabel them into a concurrent
more appropriate class [16,20]. This is generally done to enhance the performance.
Other approaches consist in relabelling into imprecise classes. This has been done
to test the evidential classification approach on imprecise labelled data in [37]. But,
as already stated, our relabelling serves a different purpose, better mapping over-
laps in the feature space. Concerning the imprecise classification, several works
have been dedicated to tackle this problem. Instead of the term “imprecise clas-
sification” that is adopted in our article, authors use terms like “nondeterminis-
tic classification” [7], “reliable classification” [24], “indeterminate classification”
[6,36], “set-valued classification” [28,31] or “conformal prediction” [3] (see [24] for
a short state of the art). In [36], the Naive Credal Classifier (NCC) is proposed as
the extension of Naive Bayes Classifier (NBC) to sets of probability distributions.
In [24] the authors propose an approach that starts from the outputs of a binary
classification [25] using classifier that are trained to distinguish aleatoric and epis-
temic uncertainty. The outputs are epistemic uncertainty, aleatoric uncertainty
and two preference degrees in favor of the two concurrent classes. [24] generalizes
this approach to the multi-class and providing set of classes as output. Closer to
our approach are approaches of [5] and [7]. The approach in [7] is based on a poste-
rior probability distribution provided by a probabilistic classifier. The advantage of
such approach and ours is that any standard probabilistic classifier may be used to
perform an imprecise classification. Our approach distinguishes itself by the rela-
belling step and by the way probabilities are allowed on sets of classes. To the best
of our knowledge existing works algorithms do not train a probabilistic classifier
on partially labelled data to quantify the body of evidence. Although we insisted
for the use of standard probabilistic classifier δ2Θ unaware of relations between the
sets, it is possible to run our procedure with an evidential classifier as the evidential
k-NN [5].
Evidential Classification of Incomplete Data 131

5 Illustration
5.1 Settings
We performed experiments on the classification problem of four plastic categories
designated plastics A, B, C and D on the basis of industrially acquired spectra.
The total of 11540 available data examples is summarized in Table 1. Each plastic
example was identified by experts on the basis of laboratory measure of atten-
uated total reflectance spectra (ATR) which is considered as a reliable source
of information for plastic category’s determination. As a consequence, original
training classes are trusted and were not questioned. However data provided by
the industrial devices may be challenged. These data consist in spectra composed
of the reflectance intensity of 256 different wavelengths. Therefore and for the
enumerated reasons in Sect. 1, the features are subject to ambiguity. Prior to
experiments, all the feature vectors, i.e., spectra, were corrected by the standard
normal variate technique to avoid light scattering and spectral noise effects. We
implemented our approach and compared it to the approaches in [5] and [7]. The
implementation is made using R packages, using existing functions for the appli-
cation of the following 8 classifiers naive Bayes classifier: (nbayes), k-Nearest
Neighbour (k-NN), decision tree (tree), random forest (rf), linear discriminant
analysis (lda), partial least squares discriminant analysis (pls-da), support vector
machine (svm) and neural networks (nnet).1

Table 1. Number of spectra of each original class in learning and testing bases.

Classes Category A Category B Category C Category D


Learning base 1416 1412 1425 1434
Testing base 1469 1458 1454 1472

5.2 Results
In order to apply our procedure, we must primary choose a set of classifiers
to perform the relabelling. These classifiers are not necessarily probabilistic
but producing point prediction. Thus, for every experimentation, our algorithm
ECLAIR was performed with the ensemble relabelling using 7 classifiers: nbayes,
k-NN, tree, rf, lda, svm, nnet2 . Then, we are able to perform the ECLAIR impre-
cise version of a selected probabilistic classifier. Figure 2, shows the recall and
precision scores of the probabilistic classifier nbayes to show the role of β. We see
the same influence of β as mentioned in [7]. Indeed, (cf Subsect. 3.3), with small
1
Experiments concerning these learning algorithm rely on the following functions
(and R packages) : naiveBayes (e1071), knn3 (caret), rpart (rpart), randomForest
(randomForest), lda (MASS), plsda (caret), svm (e1071), nnet (nnet).
2
In order to limit unbalanced classes, we choose to exclude form the learning base
examples which new labels count less than 20 examples.
132 L. Jacquin et al.

0.95
0.90
0.85
0.80
0.75

Precision
0.70

Recall

0 2 4 6 8 10 12

beta

Fig. 2. Recall and precision of ECLAIR using nbayes, i.e. δ2Θ is nbayes, against β.

values of β we have good precision, traducing the relevance of prediction, i.e.,


the size of the predicted set is reasonable; while high values of β give good recall,
meaning reliability, i.e., better chance to have true class included in the predic-
tions. The choice of β should then result form a compromise between relevance
and reliability requirement.

Table 2. Precision P of ECLAIR compared with nondeterministic with βs chosen such


that recalls equal to 0.90.

nbayes k-NN tree rf lda pls-da svm evidential k-NN


Nondeterministic 86.70 86.94 85.00 86.52 83.41 85.35 88.20 86.58
ECLAIR 87.78 87.89 83.88 87.45 82.94 86.33 88.31 86.69

In order to evaluate the performances of ECLAIR, we compared our results


to the classifier proposed in [7] that is called here nondeterministic classifier.
As nondeterministic classifier and ECLAIR are set up for a parameter β, we
decided to set βs such that global recalls equal to 0.90, and compare global
precisions on a fair basis. For even more neutrality regarding the features used
in both approach, we furnish to the nondeterministic classifier, the same reduced
features xi , i = 1, ..., N , that those used by ECLAIR in the training phase (see
Fig. 1). The 7 first columns of Table 2 shows the so obtained precisions for 7
classifiers. These results show the competitiveness of our approach for most of
the classifiers, especially nbayes, k-NN, rf and pls-da. However, these results are
only partial since they do not show the general trend for different βs that are
generally in favour of our approach. Therefore we present more complete results
for nbayes and svm in Fig. 3, showing evaluation of precision score against recall
score for several values of β varying in [0, 6]. On the same figure, we also present
the results of nondeterministic classifier with different input feature (in black):
raw features, i.e., xi ∈ Rp , LDA reduced features, i.e., xi ∈ Rn−1 and the
same features as those used for ECLAIR, i.e., xi ∈ R|A|−1 (see Fig. 1 for more
details). Doing so, we show that the good performances of ECLAIR are not only
Evidential Classification of Incomplete Data 133

Precision/Recall score svm Precision/Recall score bayes

0.90

0.90
0.85

0.85
Precision score

Precision score
0.80

0.80
0.75

0.75
ND (raw data) ND (raw data)
ND (LDA extraction) ND (LDA extraction)
ND (ECLAIRE features) ND (ECLAIRE features)
0.70

0.70
ECLAIRE ECLAIRE

0.86 0.88 0.90 0.92 0.94 0.96 0.98 0.86 0.88 0.90 0.92 0.94 0.96 0.98
Recall score Recall score

Fig. 3. Precision vs recall of Nondeterministic (ND) and ECLAIR

attributable to extraction phase. To facilitate the understanding of the results


plotted in Fig. 3, one should understand that the best performances are those
illustrated by points on the top right of the plots, i.e., higher precision and recall
scores. We observe that ECLAIR generally makes a better compromise between
the recall and precision scores for the used classifiers. Regarding the special case
when ECLAIR is performed with an evidential classifier performing example
imprecise labelled training (see Sect. 4), the comparison is less straightforward.
We considered the evidential k-NN [10] for imprecise labels by minimizing the
error suggested in [39]. Using this evidential k-NN as a classifier δ2Θ in ECLAIR
procedure is straightforward. Concerning the application of nondeterministic
classifier, we decided to keep the same parameter and turn the classifier into
probabilistic by applying the pignistic transformation to the mass output of the
k-NN classifier (see column of Table 2). ECLAIR obtains a slightly better results.

6 Conclusion
In this article, a method of evidential classification of incomplete data via impre-
cise relabelling was proposed. For any probabilistic classifier, our approach pro-
poses an adaptation to get more cautious output. The benefit of our approach
was illustrated on the problem of sorting plastics and showed competitive per-
formances. Our algorithm is generic it can be applied in any other context where
incomplete data on the features are presents. In future works we plan to exploit
our procedure to provide cautious decision-making for the problem of plastic
sorting. This application requires high reliability of the decision for preserving
the physiochemical properties of the recycle product. At the same time, the deci-
sion shall ensure reasonable relevance to guarantee financial interest, indeed the
more one plastic category is finely sorted the more benefice the industrial gets.
We also plan to strengthen our approach evaluation by confronting it with other
state of the art imprecise classifiers and by preforming experiments on several
datasets from machine learning repositories.
134 L. Jacquin et al.

References
1. Alshamaa, D., Chehade, F.M., Honeine, P.: A hierarchical classification method
using belief functions. Signal Process. 148, 68–77 (2018)
2. Ambroise, C., Denoeux, T., Govaert, G., Smets, P.: Learning from an imprecise
teacher: probabilistic and evidential approaches. Appl. Stoch. Models Data Anal.
1, 100–105 (2001)
3. Balasubramanian, V., Ho, S.S., Vovk, V.: Conformal Prediction for Reliable
Machine Learning: Theory, Adaptations and Applications. Newnes, Oxford (2014)
4. Buckley, J.J., Hayashi, Y.: Fuzzy neural networks: a survey. Fuzzy Sets Syst. 66(1),
1–13 (1994)
5. Côme, E., Oukhellou, L., Denoeux, T., Aknin, P.: Learning from partially super-
vised data using mixture models and belief functions. Pattern Recogn. 42(3), 334–
348 (2009)
6. Corani, G., Zaffalon, M.: Learning reliable classifiers from small or incomplete data
sets: the naive credal classifier 2. J. Mach. Learn. Res. 9(Apr), 581–621 (2008)
7. Coz, J.J.D., Dı́ez, J., Bahamonde, A.: Learning nondeterministic classifiers. J.
Mach. Learn. Res. 10(Oct), 2273–2293 (2009)
8. Cucchiella, F., D’Adamo, I., Koh, S.L., Rosa, P.: Recycling of weees: an economic
assessment of present and future e-waste streams. Renew. Sustain. Energy Rev.
51, 263–272 (2015)
9. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
In: Yager, R.R., Liu, L. (eds.) Classic Works of the Dempster-Shafer Theory of
Belief Functions. Studies in Fuzziness and Soft Computing, vol. 219. Springer,
Heidelberg (2008). https://doi.org/10.1007/978-3-540-44792-4 3
10. Denoeux, T.: A k-nearest neighbor classification rule based on dempster-shafer
theory. IEEE Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
11. Denoeux, T.: A neural network classifier based on dempster-shafer theory. IEEE
Trans. Syst. Man Cybern. Part A Syst. Hum. 30(2), 131–150 (2000)
12. Denoeux, T.: Maximum likelihood estimation from uncertain data in the belief
function framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
13. Denoeux, T.: Logistic regression, neural networks and dempster-shafer theory: a
new perspective. Knowl.-Based Syst. 176, 54–67 (2019)
14. Dubois, D., Prade, H.: Possibility theory. In: Meyers, R. (ed.) Computational Com-
plexity. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-1800-9
15. Dunn, J.C.: A fuzzy relative of the ISODATA process and its use in detecting
compact well-separated clusters. J. Cybern. 3, 32–57 (1973)
16. Kanj, S., Abdallah, F., Denoeux, T., Tout, K.: Editing training data for multi-label
classification with the k-nearest neighbor rule. Pattern Anal. Appl. 19(1), 145–161
(2016)
17. Kanjanatarakul, O., Kaewsompong, N., Sriboonchitta, S., Denoeux, T.: Estimation
and prediction using belief functions: Application to stochastic frontier analysis.
In: Huynh, V.N., Kreinovich, V., Sriboonchitta, S., Suriya, K. (eds.) Econometrics
of Risk. Studies in Computational Intelligence, vol. 583. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-13449-9 12
18. Kassouf, A., Maalouly, J., Rutledge, D.N., Chebib, H., Ducruet, V.: Rapid dis-
crimination of plastic packaging materials using mir spectroscopy coupled with
independent components analysis (ICA). Waste Manage. 34(11), 2131–2138 (2014)
19. Keller, J.M., Gray, M.R., Givens, J.A.: A fuzzy k-nearest neighbor algorithm. IEEE
Trans. Syst. Man Cybern. 4, 580–585 (1985)
Evidential Classification of Incomplete Data 135

20. Lallich, S., Muhlenbach, F., Zighed, D.A.: Improving classification by removing
or relabeling mislabeled instances. In: Hacid, M.-S., Raś, Z.W., Zighed, D.A.,
Kodratoff, Y. (eds.) ISMIS 2002. LNCS (LNAI), vol. 2366, pp. 5–15. Springer,
Heidelberg (2002). https://doi.org/10.1007/3-540-48050-1 3
21. Lee, H.K., Kim, S.B.: An overlap-sensitive margin classifier for imbalanced and
overlapping data. Expert Syst. Appl. 98, 72–83 (2018)
22. Leitner, R., Mairer, H., Kercek, A.: Real-time classification of polymers with NIR
spectral imaging and blob analysis. Real-Time Imaging 9(4), 245–251 (2003)
23. Naeini, M.P., Moshiri, B., Araabi, B.N., Sadeghi, M.: Learning by abstraction:
hierarchical classification model using evidential theoretic approach and Bayesian
ensemble model. Neurocomputing 130, 73–82 (2014)
24. Nguyen, V.L., Destercke, S., Masson, M.H., Hüllermeier, E.: Reliable multi-class
classification based on pairwise epistemic and aleatoric uncertainty. In: Interna-
tional Joint Conference on Artificial Intelligence (2018)
25. Senge, R., et al.: Reliable classification: learning classifiers that distinguish aleatoric
and epistemic uncertainty. Inf. Sci. 255, 16–29 (2014)
26. Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University
Press, Princeton (1976)
27. Shafer, G.: Constructive probability. Synthese 48(1), 1–60 (1981)
28. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res.
9(Mar), 371–421 (2008)
29. Smets, P.: Non-Standard Logics for Automated Reasoning. Academic Press,
London (1988)
30. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66(2), 191–234
(1994)
31. Soullard, Y., Destercke, S., Thouvenin, I.: Co-training with credal models. In:
Schwenker, F., Abbas, H.M., El Gayar, N., Trentin, E. (eds.) ANNPR 2016. LNCS
(LNAI), vol. 9896, pp. 92–104. Springer, Cham (2016). https://doi.org/10.1007/
978-3-319-46182-3 8
32. Sutton-Charani, N., Imoussaten, A., Harispe, S., Montmain, J.: Evidential bagging:
combining heterogeneous classifiers in the belief functions framework. In: Medina,
J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 297–309. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-91473-2 26
33. Walley, P.: Towards a unified theory of imprecise probability. Int. J. Approximate
Reasoning 24(2–3), 125–148 (2000)
34. Xiong, H., Li, M., Jiang, T., Zhao, S.: Classification algorithm based on nb for
class overlapping problem. Appl. Math 7(2L), 409–415 (2013)
35. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1(1),
3–28 (1978)
36. Zaffalon, M.: The naive credal classifier. J. Stat. Plan. Infer. 105(1), 5–21 (2002)
37. Zhang, J., Subasingha, S., Premaratne, K., Shyu, M.L., Kubat, M., Hewawasam,
K.: A novel belief theoretic association rule mining based classifier for handling
class label ambiguities. In: Proceeidngs of Workshop Foundations of Data Mining
(FDM 2004), International Conferenece on Data Mining (ICDM 2004) (2004)
38. Zheng, Y., Bai, J., Xu, J., Li, X., Zhang, Y.: A discrimination model in waste
plastics sorting using nir hyperspectral imaging system. Waste Manage. 72, 87–98
(2018)
39. Zouhal, L.M., Denoeux, T.: An evidence-theoretic k-NN rule with parameter opti-
mization. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 28(2), 263–271
(1998)
An Analogical Interpolation Method
for Enlarging a Training Dataset

Myriam Bounhas1,2(B) and Henri Prade3


1
Emirates College of Technology, Abu Dhabi, UAE
myriam [email protected]
2
LARODEC Lab., ISG de Tunis, Tunis, Tunisia
3
IRIT, Université Paul Sabatier, 118 route de Narbonne,
31062 Toulouse cedex 09, France
[email protected]

Abstract. In classification problems, it happens that the training set


remains scarce. Given a data set, described in terms of discrete, ordered
attribute values, we propose an interpolation-based approach in order
to predict new examples useful for enlarging the original data set. The
proposed approach relies on the use of continuous analogical proportions
that are statements of the form “a is to x as x is to c”. The prediction
is made on the basis of pairs of examples (a, c) present in the data set,
for which one can find a value for x for each attribute value as well as
for the corresponding class label of the example thus created. The first
option that we consider is to select x as the midpoint between a and c,
attribute by attribute. To extend the search space, we may also choose
x as any randomly selected value between the values of a and c. We first
propose a basic algorithm implementing these two interpolation defini-
tions, then we extend it to two improved algorithms. In the former, we
only consider the nearest neighbor pairs (a, c) to x for prediction, while,
in the latter, we further restrict the search to those pairs (a, c) having
the same class label. The experimental results, for classical ML classifiers
applied to the enlarged data sets built by the proposed algorithms, show
the effectiveness of analogical interpolation methods for enlarging data
sets.

1 Introduction
Analogical proportions are statements of the form “a is to b as c is to d”. In the
Nicomachean Ethics, Aristotle makes an explicit parallel between such state-
ments and geometric proportions of the form “ ab = dc ”, where a, b, c, d are num-
bers. It also parallels arithmetic proportions, or difference proportions, which
are of the form “a − b = c − d”. The logical modeling of an analogical proportion
as a quaternary connective between four Boolean items appears to be a logical
counterpart of such numerical proportions [15]. It has been extended to items
described by vectors of Boolean, nominal or numerical values [2].
A particular case of such statements, named continuous analogical propor-
tions, is obtained when the two central components are equal, namely they are
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 136–152, 2019.
https://doi.org/10.1007/978-3-030-35514-2_11
An Analogical Interpolation Method for Enlarging a Training Dataset 137

statements of the form “a is to b as b is to c”. In case of numerical propor-


√ assume that b is unknown, it can be expressed in terms of a and
tions, if we
c as b = a · c in the geometric case, and as a+c 2 in the√arithmetic case. Note
that similar inequalities hold in both cases: min(a, c) ≤ a · c ≤ max(a, c) and
min(a, c) ≤ a+c2 ≤ max(a, c). This means that the continuous analogical propor-
tion induces a kind of interpolation between a and c in the numerical case by
involving an intermediary value that can be obtained from a and c.
General analogical proportions when d is unknown provides an extrapolation
mechanism, which with numbers yields d = b·c a and d = b+c−a in the geometric
and arithmetic cases respectively. We recognize the expression of the well-known
Rule of Three in the first expression. Analogical proportions-based inference [2]
offers a similar extrapolation device relying on the parallel between (a, b) and
(c, d) stated by “a is to b as c is to d”.
The analogical proportions-based extrapolation has been successfully applied
to classification problems. It may be used either directly as a new classification
paradigm [2,12], or as a way of completing a training set on which classical
classification methods are applied once this set has been completed [1,4]. This
paper investigates the effectiveness of the simpler option of using only continuous
analogical proportions that involve pairs instead of triples of items, in order to
enlarge a training set.
The paper is organized as follows. Section 2 provides a short background on
analogical proportions and more particularly on continuous ones. Then Sect. 3
surveys related work on analogical interpolation or extrapolation. Section 4
presents different variants of algorithms for completing a training set based on
the idea of continuous analogical proportions. Section 5 reports the results of the
use of different classical classification techniques on the corresponding enlarged
training sets for various benchmarks.

2 Background: Continuous Analogical Proportion

The statement “a is to b as c is to d”, here denoted a : b :: c : d, expresses that


“a differs from b as c differs from d, and b differs from a as d differs from c”. The
logical counterpart of the latter statement, where a, b, c, d are Boolean variables,
is given by:

a : b :: c : d = (¬a ∧ b ≡ ¬c ∧ d) ∧ (¬b ∧ a ≡ ¬d ∧ c)

See [13,16] for justifications. This expression is true for only 6 patterns of values
for abcd, namely {0000, 0011, 0101, 1111, 1100, 1010}. This extends to nominal
values where a : b :: c : d holds true if and only if abcd is one of the following
patterns ssss, stst, or sstt, where s and t are two possible distinct values of
items a, b, c and d.
Regarding continuous analogical proportions, it can be easily checked that
the unique solutions of equations 1 : x :: x : 1 and 0 : x :: x : 0 are respectively
x = 1 and x = 0, while 1 : x :: x : 0 or 0 : x :: x : 1 have no solution in the
138 M. Bounhas and H. Prade

Boolean case. This somewhat trivializes continuous analogical proportions in the


Boolean case. The situation for nominal values is the same.
The case of numerical values is richer. a, b, c, d are now supposed to be
normalized values in the real interval [0, 1]. The reader is referred to [6] for a
general discussion of multiple-valued logic extensions of analogical proportions.
They can be associated with the following expression:


⎪1− | (a − b) − (c − d) |,

⎨ if a ≥ b and c ≥ d, or a ≤ b and c ≤ d
a : b :: c : d = (1)

⎪ 1 − max(| a − b |,| c − d |),


if a ≤ b and c ≥ d, or a ≥ b and c ≤ d

It coincides with a : b :: c : d on {0, 1}. As can be seen, a : b :: c : d is


equal to 1 if and only if (a − b) = (c − d). For instance, 0.2 : 0.5 :: 0.6 : 0.9,
or 0.2 : 0.5 :: 0.2 : 0.5 holds true. Because |a − b| = |(1 − a) − (1 − b)|, it is
easy to check that the code independence property: a : b :: c : d = (1 − a) :
(1 − b) :: (1 − c) : (1 − d) holds (0 and 1 play symmetric roles, and it is the same
to encode an attribute positively or negatively).
Then the corresponding expression for continuous analogical proportions
is [16]:


⎪ 1− | a + c − 2b |,

⎨ if a ≥ b and b ≥ c, or a ≤ b and b ≤ c
a : b :: b : c = (2)
⎪1 − max(| a − b |,| b − c |),



if a ≤ b and b ≥ c, or a ≥ b and b ≤ c

As can be seen a : b :: b : c = 1 if and only if b = (a + c)/2 (which includes


the case a = b = c). The proportions 0 : 12 :: 12 : 1 or 0.3 : 0.6 :: 0.6 : 0.9
are examples of continuous analogical proportions. Moreover, 1 : 3 :: 3 : 5 is an
example of continuous analogical proportion between nominal ordered grades.
Thus this extension captures the idea of betweenness implicit in statements of
the form “a is to b as b is to c”. Note that we have 0 : 1 :: 1 : 0 = 0 and
1 : 0 :: 0 : 1 = 0, as expected.
Analogical proportions extend to vectors in a component-wise manner. Let
a = (a1 , . . . , am ), where each ai belongs to {0, 1} (Boolean case), or to a finite
set with more than 2 elements (nominal case), or to [0, 1] (numerical case).
b, c, d are defined similarly. Then a : b :: c : d has a truth value which is just
minm i=1 ai : bi :: ci : di .
In this paper, we deal with classification. So each vector a in a training set
is associated with its class cl(a). Thus saying that the continuous analogical
proportion a : x :: x : c holds true amounts to say:
a : x :: x : c = 1 iff
(3)
aj : xj :: xj : cj = 1 for each attribute j and cl(a) : cl(x) :: cl(x) : cl(c) = 1
Moreover, since continuous analogical proportions are trivial for a Boolean or a
nominal variable, we shall also use a more liberal extension of betweenness for
An Analogical Interpolation Method for Enlarging a Training Dataset 139

the vectorial case [10] in this paper. Namely, we shall say x is between a and c
defined as:

between(a, x, c) = 1 iff aj ≤ xj ≤ cj or cj ≤ xj ≤ aj for each attribute j. (4)

Then we can define the set Between(a, c) of vectors between two vectors a and
c. For instance, we have Between(01000, 11010) = {01000, 11000, 01010, 11010}.
Note that in case of Boolean values, the betweenness condition can also be
written as ∀i = 1, · · · , m, (ai ∧ ci → xi ) ∧ (xi → ai ∨ ci ) = 1.

3 Related Work
The idea of generating, or completing, a third example from two examples can be
encountered in different settings. An option, quite different from interpolation, is
the “feature knock out” method [23], where a third example is built by modifying
a randomly chosen feature of the first example with that of the second one. A
somewhat related idea can be found in a recent proposal [3] which introduces
a measure of oddness with respect to a class that is computed on the basis of
pairs made of two nearest neighbors in the same class; this amounts to replace
the two neighbors by a fictitious representative of the class.
Reasoning with a system of fuzzy if-then rules provides an interpolation
mechanism [14], which, from these rules and an input “in-between” their con-
dition parts, yields a new conclusion “in-between” their conclusion parts, by
taking advantage of membership functions that can be seen as defining fuzzy
“neighborhoods”.
Moreover, several approaches based on the use of interpolation and analog-
ical proportions have been developed in the past decade. In [17], the problem
considered is to complete a set of parallel if-then rules, represented by a set of
condition variables associated to a conclusion variable. The values of the vari-
ables are assumed to belong to finite sets of ordered labels. The basic idea is
to apply analogical proportion inference in order to induce missing rules from
an initial set of rules, when an analogical proportions hold between the variable
labels of several parallel rules. Although this approach may seem close to the
analogical interpolation-based approach proposed in this paper, our goal is not to
predict just the conclusion part of an incomplete rule, but rather a whole exam-
ple including its attribute-based description and its class. Moreover, we restrict
our study to the use of pairs of examples for this prediction, while in [17] the
authors use both pairs or triples of rules for completing rules. An extended ver-
sion of the above-mentioned work has been presented in [22] where the authors
also propose a more cautious method that makes explicit the basic assumptions
under which rule conclusions are produced from analogical proportions. Along
the same line, see also [21] on interpolation between default rules.
Let us also mention the general approach proposed by Schockaert and Prade
[20] to interpolative and extrapolative reasoning from incomplete generic knowl-
edge represented by sets of symbolic rules, handled in a purely qualitative man-
ner, where labels are represented in conceptual spaces. This work is an extended
140 M. Bounhas and H. Prade

version of [19] in which only interpolative inference is considered. The same


authors present an illustrative case study in [18] in the music domain. In the
context of natural language modeling, Derrac and Schockaert [5] have proposed
a data-driven approach that exploits betweenness and a fortiori inference to
derive semantic relations within conceptual spaces.
Besides, some previous works have considered, discussed and experimented
the idea of an analogical proportion-based enlargement of a training set, based
on triples of examples. In [1], the authors proposed an approach to generate
synthetic data to tune a handwritten character classifier. Couceiro et al. [4]
presented a way to extend a Boolean sample set for classification using the
notion of “analogy preserving” functions that generate examples on the basis of
triples of examples in the training set. The authors only tested their approach
on Boolean data.
In a more recent work, Lieber et al. [10] have extended the paradigm of classi-
cal Case-Based Reasoning to link the current case to either pairs of known cases
by performing a restricted form of interpolation, or to triples of known cases by
exploiting extrapolation, taking advantage of betweenness and analogical pro-
portion relations.
Lastly, in the context of deep learning, Goodfellow et al. [7] invented the
idea of a generative adversarial network (GAN) as a class of machine learning
systems. Given a training set, two neural networks, contesting with each other in
a game, are learnt in order to generate new data with the same statistics as the
training set. More recently, Inoue [9] presented a data augmentation technique
for image classification that mix two randomly picked images to train a classifier.

4 Analogical Interpolation-Based Predictor (AIP)

Analogical proportions have been recently applied to classification problems and


have shown their efficiency for classifying a variety of datasets [2]. In this paper,
we aim to investigate if continuous analogical proportions could be useful for a
prediction purpose, namely enlarging a training set with made examples, and if
standard classification methods applied to this enlarged set can compete with
the direct application of analogical proportions-based inference for classification.
As said before, the basic idea of the paper is to apply an interpolation method
for predicting new examples not present in the original data set which is just
enlarged.
In the following, we describe the basic principle of our predicting approach.

4.1 Basic Procedure



Consider a set E of n classified examples i.e., E = (x1 , y 1 ), ..., (xi , y i ), ...,
(xn , y n ) such that the class label y i = cl(xi ) is known for each i ∈ 1, ..., n.
The goal is to predict a new set of examples S = {(xk , y k ) ∈
/ E} by interpolat-
ing examples from the set E. The new set S will serve for enlarging E.
An Analogical Interpolation Method for Enlarging a Training Dataset 141

The basic idea is to find pairs of examples (a, c) ∈ E 2 with known labels such
that the analogical proportion (3) is solvable attribute by attribute i.e., there
exists x such that aj : xj :: xj : cj = 1 for each attribute j = 1, ..., m, and the
class equation has cl(x) as a solution, i.e., cl(a) : cl(x) :: cl(x) : cl(c) = 1.
As mentioned before in Sect. 2, the solution for the previous equation aj :
xj :: xj : cj = 1 in the numerical case is just the midpoint xj = (aj + cj )/2 for
each attribute j = 1, ..., m. We are interested in the case of ordered nominal val-
ues in this paper. Moreover, we assume that the distances between any two suc-
cessive values in such an ordered set of values are the same. Let V = {v1 , · · · , vk }
be an ordered set of nominal values, then, vi will be regarded as the midpoint of
vi−j and vi+j with j ≥ 1, provided that both vi−j and vi+j exist. For instance,
if V = {1, · · · , 5}, the analogical proportions 1 : 3 :: 3 : 5 or 2 : 3 :: 3 : 4 hold,
while 2 : x :: x : 5 = 1 has no solution. So it is clear that some pairs (a, c) will
not lead to any solution since we restrict the search space to the pairs for which
the midpoint (attribute by attribute) exists.
This condition may be too restrictive especially for datasets with high number
of attributes which may reduce the set of predicted examples. In case of success,
the predicted example x = {x1 , ..., xj , ...xm } will be assigned to the predicted
class label cl(x) and saved in a candidate set.
Since different voting pairs may predict the same example x more than once
(x may be the midpoint of more than one pair (a, c)), a candidate example may
have different class labels. Then has to perform a vote on class labels for each
candidate example classified differently in the candidate set. This leads to the
final predicted set of examples where each example is classified uniquely.
This process can be described by the following procedure:

1. Find solvable pairs (a, c) such that Eq. 3 has a unique non null solution x.
2. In case of ties (an example x is predicted with different class labels), apply
voting on all its predicted class labels and assign to x the success label.
3. Add x to the set of predicted examples (together with cl(x)).

In the next section, we first present a basic algorithm applying the process
described above, then we propose two options that may help to improve the
search space for the voting pairs.

4.2 Algorithms
The simplest way is to systematically consider all pairs (a, c) ∈ E 2 , for which
Eq. 3 is solvable, as candidate pairs for prediction. Algorithm 1 implements a
basic Analogical Interpolation-based Predictor, denoted AIPstd , without applying
any filter on the voting pairs.
Considering all pairs (a, c) for prediction may seem unreasonable especially
when the domain of attribute values is large since this may blur prediction
results. A first improvement of Algorithm 1 is to restrict the search for pairs to
those that are among the nearest neighbors (in terms of Hamming distance) to
the example to be predicted.
142 M. Bounhas and H. Prade

Algorithm 1. Analogical Interpolation-based Predictor (AIPstd )


Input: A set E of classified instances
CandidatesSet = ∅
S=∅
for each pair (a, c) in E 2 do
if cl(a) : cl(x) :: cl(x) : cl(c) = 1 has solution l then
if a : x :: x : c = 1 has solution b then
cl(b) = l
CandidatesSet.add(b)
end if
end if
end for
S = VoteOnclasses(CandidatesSet )
Comp(E)= E + S
return (Comp(E))

Let us consider two different pairs (a, c) and (d, e) ∈ E 2 . We assume that
a : x :: x : c = 1 produces as solution an example b and d : x :: x : e = 1
produces an other example b = b. If b is closest to (d, e) than b is to (a, c)
in terms of Hamming distance, it is more reasonable to consider only the pair
(d, e) for prediction. This means that example b will be predicted while b will be
rejected. We denote AIPN N this second improved Algorithm 2 in the following.
Algorithm 3 (that we denote AIPN N,SC ) is exactly the same as Algorithm 2 in
all respects, except that we look for only pairs (a, c) belonging to the same class
in this case. Note that the two algorithms only differ for non binary classification
problems, since s : x :: x : t = 1 has no solution in {0, 1} for s = t.

4.3 Another Option

As can be seen in the next section, searching for the best pairs (described in
Algorithms 2 and 3) limits the number of accepted voting pairs. Moreover, there
is a second constraint to be satisfied, that is limiting the solutions of Eq. 3 to
the values of x that are the midpoint of a and c which is hard to be satisfied
in the ordered nominal setting. To relax this last constraint, we may think to
use the “betweenness” definition given in Eq. 4. In this definition, the equation
between(a, x, c) = 1 has, as a solution, any x such that x is between a and c
for each attribute j ∈ 1, ..., m. This last option is implemented by the algorithm
denoted AIPBtw which is exactly the same as Algorithm 3 except that we use
the definition (4) to solve the analogical interpolation.
An Analogical Interpolation Method for Enlarging a Training Dataset 143

Algorithm 2. Analogical Interpolation-based Predictor using Nearest Neigh-


bors pairs for prediction (AIPN N )
Input: A set E of classified instances
CandidatesSet = ∅
PredictedSet = ∅
M inHD = NbrAttribute
for each pair (a, c) in E 2 do
if cl(a) : cl(x) :: cl(x) : cl(c) = 1 has solution l then
if a : x :: x : c = 1 has solution b then
cl(b) = l
HD = Max(HammingDistance(b,a), HammingDistance(b,c))
if HD < M inHD then
M inHD = HD
CandidateSet.clean()
CandidatesSet.add(b)
else if HD =M inHD then
CandidatesSet.add(b)
end if
end if
end if
end for
S = VoteOnclasses(CandidatesSet )
Comp(E)= E + S
return (Comp(E))

5 Experimentations and Discussion

In this section, we aim to evaluate the efficiency of the proposed algorithms for
predicting new examples. For this purpose, we first run different standard ML
classifiers on the original dataset, then we apply each AI-Predictor to generate a
new set of predicted examples that is used to enlarge the original data set. This
leads us to four different enlarged datasets, one for each proposed algorithm.
Finally, we re-evaluate again ML classifiers on each of these completed datasets.
For both original and enlarged datasets, we apply the testing protocol presented
in the next sub-section.
In this experimentation, we tested with the following standard ML classifiers:

• IBk: a k-NN classifier, we use the Manhattan distance and we tune the
classifier on different values of the parameter k = 1, 2, ..., 11.
• C4.5: generating a pruned or unpruned C4.5 decision tree. We tune the
classifier with different confidence factors used for pruning C = 0.1, 0.2, ..., 0.5.
• JRip: propositional rule learner, Repeated Incremental Pruning to Produce
Error Reduction (RIPPER). We tune the classifier for different values of
optimization runs O = 2, 4, ...10 and we apply pruning.
144 M. Bounhas and H. Prade

Algorithm 3. Analogical Interpolation-based Predictor using Nearest Neigh-


bors pairs in the same class for prediction (AIPN N,SC )
Input: A set E of classified instances
CandidatesSet = ∅
PredictedSet = ∅
M inHD = NbrAttribute
for each pair (a, c) in E 2 do
if cl(a) = cl(c) then
if a : x :: x : c = 1 has solution b then
cl(b) = cl(a) //or cl(c)
HD = Max(HammingDistance(b,a), HammingDistance(b,c))
if HD < M inHD then
M inHD = HD
CandidateSet.clean()
CandidatesSet.add(b)
else if HD =M inHD then
CandidatesSet.add(b)
end if
end if
end if
end for
S = VoteOnclasses(CandidatesSet )
Comp(E)= E + S
return (Comp(E))

5.1 Datasets for Experiments


The experimental study is based on several datasets taken from the U.C.I.
machine learning repository [11]. A brief description of these data sets is given
in Table 1.
To apply the analogical interpolation, we have chosen to deal only with
ordered nominal datasets in this study (the extension to the numerical case
is the topic of a future work). Table 1 includes 10 datasets with ordered nom-
inal or Boolean attribute values. In terms of classes, we deal with a maximum
number of 5 classes.
– Balance, Car, Hayes-Roth and Nursery are multiple classes datasets.
– Monk1, Monk2, Monk3, Breast Cancer, Voting and W. B. Cancer datasets
are binary class problems. Monk3 has noise added (in the sample set only).
Voting data set contains only binary attributes and has missing attribute
values. As a missing value, in this dataset, simply means that this value is
not “yes” nor “no”, we replace each missing value by a third value other than
0 and 1. These data sets are described in Table 1.

5.2 Testing Protocol


To test ML classifiers, we apply a standard 10 fold cross-validation technique.
As usual, the final accuracy is obtained by averaging the 10 different accuracies
An Analogical Interpolation Method for Enlarging a Training Dataset 145

(computed as the ratio of the number of correct predictions to the total number
of test examples) for each fold. However, each ML classifier requires a parameter
to be tuned before performing this cross-validation.

Table 1. Description of datasets

Datasets Instances Nominal Att. Binary Att. Classes


Balance 625 4 0 3
Car 743 6 0 4
Monk1 432 4 2 2
Monk2 432 4 2 2
Monk3 432 4 2 2
Breast Cancer 286 6 3 2
Voting 435 0 16 2
Hayes-Roth 132 5 0 3
W. B. Cancer 699 9 0 2
Nursery 1102 8 0 5

In order to do that, we randomly choose a fold (as recommended by [8]), we


keep only the corresponding training set (i.e. which represents 90% of the full
dataset). On this training set, we again perform a 10-fold cross-validation with
diverse values of the parameters. We then select the parameter values providing
the best accuracy. These tuned parameters are then used to perform the initial
cross-validation. As expected, these tuned parameters change with the target
dataset. To be sure that our results are stable enough, we run each algorithm
(with the previous procedure) 10 times so we have 10 different parameter opti-
mizations. The displayed parameter p is the average value over the 10 different
values (one for each run). The results shown in Table 2 are the average values
obtained from 10 rounds of this complete process.

5.3 Experimental Results


In the following, we first provide a comparative study of the overall accuracies
for ML classifiers obtained with original and enlarged datasets. This study aims
to check if examples predicted by the AIP are of good quality (namely labeled
with the suitable class). In such case, the efficiency of ML classifiers should
be improved when applied to enlarged datasets. Then we also report the main
characteristics of these predicted datasets. Finally, we compare ML classification
results with enlarged datasets to the ones obtained by directly applying Analogy-
based Classification [2] to the original datasets. In this last study, we wonder if
using ML classifiers with enlarged datasets may perform similarly as Analogy-
based Classification [2] to the original datasets while maintaining a reduced
complexity.
146 M. Bounhas and H. Prade

Results of ML-Classifiers. Accuracy results for IBk, C4.5 and JRIP are
obtained by using the free implementation of Weka software to the enlarged
datasets obtained from AI-Predictors. To run IBk, C4.5 and JRIP, we first opti-
mize the corresponding parameter for each classifier, using the meta CVPa-
rameterSelection class provided by Weka using a cross-validation applied to the
training set only. This enables us to select the best value of the parameter for
each dataset, then we train and test the classifier using this selected value of this
parameter.
Table 2 provides classification results of ML classifiers obtained with a 10-
fold cross validation and for the best/optimized value of the tuned parameter
(denoted p in this table).
Results in the previous table show that:

Table 2. Results for ML classifiers obtained with the enlarged datasets

Datasets KNN C4.5 JRIP


Accuracy p Accuracy p Accuracy p
Balance AIPN N,SC 85.7 ± 2.13 1 74.15 ± 2.42 0.5 76.05 ± 2.85 9
AIPN N 85.31 ± 3.24 1 73.73 ± 4.12 0.5 75.09 ± 3.23 6
AIPStd 78.16 ± 1.15 3 65.92 ± 2.73 0.5 68.45 ± 3.73 6
AIPBtw 83.04 ± 3.42 3 75.44 ± 3.89 0.5 75.21 ± 4.64 7
Orig. 84.05 ± 2.6 11 63.79 ± 4.33 0.3 72.74 ± 3.48 6
Car AIPN N,SC 91.4 ± 1.84 1 92.78 ± 1.28 0.4 88.6 ± 2.82 8
AIPN N 91.5 ± 1.95 1 93.14 ± 1.95 0.5 89.13 ± 2.55 8
AIPStd 91.51 ± 1.91 3 92.26 ± 1.85 0.3 89.09 ± 1.93 6
AIPBtw 86.74 ± 2.71 4 88.74 ± 1.99 0.4 85.61 ± 2.38 8
Orig. 92.38 ± 2.51 1 90.84 ± 3.61 0.5 86.58 ± 3.67 8
Monk1 AIPN N,SC 94.58 ± 2.7 5 94.11 ± 2.88 0.2 93.75 ± 2.48 2
AIPN N 94.82 ± 2.37 3 94.53 ± 2.35 0.1 93.62 ± 1.9 2
AIPStd 87.07 ± 4.48 3 87.35 ± 2.49 0.1 83.21 ± 4.34 6
AIPBtw 85.34 ± 3.91 3 88.15 ± 4.78 0.3 89.46 ± 3.66 4
Orig. 98.37 ± 2.78 2 99.36 ± 0.64 0.4 90.99 ± 13.15 2
Monk2 AIPN N,SC 82.41 ± 4.77 1 72.44 ± 0.19 0.1 71.91 ± 3.32 5
AIPN N 82.49 ± 7.56 1 72.44 ± 0.19 0.1 71.87 ± 3.8 3
AIPStd 76.12 ± 4.28 3 77.03 ± 0.0 0.1 76.6 ± 0.43 4
AIPBtw 80.86 ± 0.79 3 80.79 ± 0.78 0.1 80.56 ± 0.82 3
Orig. 65.29 ± 1.74 11 67.13 ± 0.61 0.1 64.64 ± 3.69 4
Monk3 AIPN N,SC 98.38 ± 1.31 3 98.41 ± 1.31 0.1 98.24 ± 1.49 2
AIPN N 98.38 ± 1.41 3 98.41 ± 1.41 0.1 98.27 ± 1.42 2
AIPStd 92.91 ± 2.47 3 93.58 ± 3.09 0.1 92.09 ± 2.63 4
AIPBtw 97.75 ± 1.76 3 97.71 ± 1.76 0.1 97.87 ± 1.79 2
Orig. 99.14 ± 1.49 1 99.82 ± 0.18 0.2 98.95 ± 1.48 2
(continued)
An Analogical Interpolation Method for Enlarging a Training Dataset 147

Table 2. (continued)

Datasets KNN C4.5 JRIP


Accuracy p Accuracy p Accuracy p
Breast Cancer AIPN N,SC 75.57 ± 8.31 4 74.01 ± 7.29 0.2 71.9 ± 8.6 4
AIPN N 75.59 ± 4.95 5 73.68 ± 6.85 0.2 71.0 ± 7.49 5
AIPStd 83.0 ± 3.19 6 82.47 ± 3.93 0.1 80.3 ± 7.01 3
AIPBtw 75.86 ± 5.27 4 75.94 ± 5.99 0.2 72.61 ± 5.84 4
Orig. 72.81 ± 7.65 9 71.58 ± 6.55 0.2 70.11 ± 8.59 2
Voting AIPN N,SC 93.32 ± 3.58 4 95.65 ± 2.67 0.2 95.62 ± 2.85 3
AIPN N 93.05 ± 3.17 3 95.79 ± 3.59 0.3 95.67 ± 3.07 3
AIPStd 93.89 ± 2.31 2 96.12 ± 2.02 0.3 96.1 ± 2.04 3
AIPBtw 93.22 ± 3.84 2 95.45 ± 2.37 0.2 95.73 ± 2.13 2
Orig. 92.5 ± 3.59 4 96.38 ± 2.63 0.2 95.84 ± 2.39 4
Hayes-Roth AIPN N,SC 74.62 ± 8.84 1 74.4 ± 9.63 0.2 84.79 ± 7.65 4
AIPN N 73.91 ± 8.0 1 74.13 ± 7.65 0.2 85.12 ± 6.58 5
AIPStd 60.45 ± 11.59 3 70.62 ± 9.3 0.4 78.78 ± 9.67 4
AIPBtw 69.87 ± 7.77 1 80.43 ± 12.53 0.1 88.52 ± 8.8 2
Orig. 61.41 ± 10.31 3 68.2 ± 6.66 0.2 83.26 ± 9.04 4
W. B. Cancer AIPN N,SC 95.92 ± 1.69 1 94.38 ± 3.38 0.4 94.57 ± 2.15 5
AIPN N 96.12 ± 2.47 1 94.05 ± 2.82 0.3 94.5 ± 2.31 4
AIPStd 96.82 ± 1.22 3 97.37 ± 1.23 0.5 96.56 ± 2.19 5
AIPBtw 95.99 ± 1.17 2 94.43 ± 1.49 0.4 94.44 ± 2.16 5
Orig. 96.7 ± 1.73 3 94.79 ± 3.19 0.2 95.87 ± 2.9 4
Nursery AIPN N,SC 98.23 ± 0.96 1 98.69 ± 0.56 0.4 97.78 ± 1.12 6
AIPN N 98.25 ± 0.78 1 98.74 ± 0.64 0.5 97.83 ± 1.25 5
AIPStd 97.73 ± 0.88 1 98.0 ± 0.96 0.5 97.74 ± 0.99 5
AIPBtw 95.9 ± 0.97 3 96.51 ± 1.34 0.4 95.78 ± 1.5 6
Orig. 97.45 ± 1.34 3 97.7 ± 1.36 0.5 95.58 ± 2.04 4
Average AIPN N,SC 89,01 – 86,90 – 87,32 –
AIPN N 88,94 – 86,86 – 87,21 –
AIPStd 85,76 – 86,07 – 85,89 –
AIPBtw 86,45 – 87,35 - 87,57 –
Orig. 86,01 – 84,96 – 85,46 –

– The accuracy results have been improved when applying ML classifiers on


the new predicted data instead of the original data. This is noticed for all
datasets except for Monk1 and Monk3 datasets. The highest improvement
percentage was noticed with the IBk classifier for the dataset Monk2 (17%),
Hayes-Roth (13%) and Breast Cancer (11%).
– Regarding the two artificial datasets Monk1 and Monk3, it is known in the
original dataset, that only two attributes among 6 are involved to define the
class label for each example. We may think that using the midpoint value for
each attribute as well as the class label, applied in the proposed analogical
148 M. Bounhas and H. Prade

interpolation which treat equally all attributes, is not compatible with this
kind of classification.
– The good improvement observed for Monk2 dataset confirms our previous
intuition since, contrary to Monk1 and Mon3, in Monk2 all attributes are
involved in defining the class label in this dataset.
– The standard Algorithm 1 outperforms other algorithms in case of Cancer and
Breast Cancer datasets. It is important to note that only these two datasets
include attributes with large range of values (with maximum of 10 different
values for Cancer and 13 different values for Breast Cancer). Moreover the
number of attributes is also high if compared to other datasets. We expect
that, in case ordered nominal data is represented by a large scale, using only
nearest neighbor pairs for prediction seems too restrictive and leads to a local
search for new examples.
– There is no particular algorithm that provides the best results for all datasets.
– We computed the average accuracy for each proposed algorithm and for each
ML classifier over all datasets. Results are given at the end of Table 2. We can
note that IBk classifier performs the best accuracy when using the enlarged
data built from the AIPN N,SC Algorithm. While C4.5 and JRIP perform
better when applied to the dataset built from AIPBtw Algorithm.
– Overall, the IBK classifier shows the highest classification accuracy over all
datasets.

In this first study, the improved results of ML classifiers when applied to enlarged
datasets show the ability of the proposed algorithms (especially, AIPN N,SC and
AIPBtw ) to predict examples that are labeled with the suitable class.

Characteristics of the Predicted Datasets. To have a better understanding


of the previous shown results, in this subsection we aim to investigate more
the new predicted datasets. For this end, we compute the number of predicted
examples for each dataset and the proportion of these examples that are assigned
to the correct/suitable class label. This proportion is computed on the basis of
the predicted examples that are compatible with the original set. For this new
experimentation, we only consider examples predicted by Algorithm AIPN N,SC
(and AIPStd for some datasets). We save these additional results in Table 3.
From these results, we can see that:

– In seven among ten datasets, the proportion of predicted examples that are
successfully classified is 100%. This means that all predicted examples that
match the original set are assigned to the correct class label and thus are
f ully compatible with the original set (see for example Monk2, Breast Cancer,
Hayes Roth and Nursery).
– Predicting accurate examples in these datasets may explain why ML classi-
fiers show high classification improvement when applied to the new enlarged
dataset.
– Although AIPN N,SC Algorithm succeeds to predict accurate examples, the
number of predicted examples is very reduced for some datasets such as for
An Analogical Interpolation Method for Enlarging a Training Dataset 149

Breast Cancer, Voting and Cancer. This due to the fact that we restrict the
search for only nearest neighbors pairs belonging to the same class in this
Algorithm. It is important to note that these datasets contains large number
of attributes which make the process of pairs filter more constraining.
– As can be seen in Table 3, the size of the predicted sets is considerably
increased, for these three datasets, when applying AIPStd Algorithm which is
less constraining than AIPN N,SC (520 examples instead of 46 are predicted
for Cancer dataset). In Table 2, we also noticed that, only for these three cited
datasets, IBK performs considerably better when applied to the datasets built
from the standard algorithm AIPStd (producing larger sets). It is clear that
in case the predicted set is very reduced, the enlarged dataset remains similar
to the original set that’s why the improvement percentage of ML classifiers
cannot be clearly noticed in the case of datasets predicted from AIPN N,SC
Algorithm.
– Lastly for some datasets such as Monk1 and Monk3, the proportion of pre-
dicted examples that are compatible with the original set is low if compared
to other datasets. As explained before, in the original sets, the classification
function involves only 2 among 6 attributes which seems incompatible with
continuous analogical interpolation assuming that all attributes as well as
class label are the midpoint of the attributes and the class label of the pair
used for prediction.

Table 3. Nbr. of predicted examples, proportion of predicted examples that are com-
patible with the original set

Datasets Nbr. predicted Prop. of success


Balance 529 85.82
Car 630 93.44
Monk1 288 87.5
Monk2 221 100
Monk3 320 96.25
Breast Cancer-AIPN N,SC 14 100
Breast Cancer-AIPStd 152 83.78
Voting-AIPN N,SC 38 100
Voting-AIPStd 95 100
Hayes-Roth 27 100
Cancer-AIPN N,SC 46 100
Cancer-AIPStd 520 100
Nursery 883 99.89
150 M. Bounhas and H. Prade

Comparison with AP-Classifier [2]. Finally, we provide a comparative study


of ML classifiers results, reported in Sect. 5.3, to the results obtained with a
direct application of analogical proportions for a classification purpose [2]. Note
that in [2], analogical proportions-based extrapolation has been directly applied
to define a new classification paradigm while in this paper we exploit analog-
ical proportions-based interpolation to enlarge datasets on which classical ML
classifiers are applied. Classification accuracies of analogical proportions-based
classifiers [2] are given in Table 4 and compared to the best result of each ML
classifier applied to the enlarged datasets. Results in Table 4 shows that AP-
Classifier outperforms classic ML classifiers on five datasets especially on the
three Monks datasets. However enlarged datasets, using analogical interpola-
tion, helped to reduce the gap between AP-Classifier and other ML classifiers
once they were applied to these enlarged data. On the other side, ML classifiers
provides better accuracies on four other datasets (see for example the Breast
cancer (resp. Hayes-Roth) dataset for which the IBK (resp. JRIP) is largely
better than AP-Classifier).

Table 4. Results for ML classifiers obtained with the enlarged datasets and comparison
with AP-Classifier [2]

Datasets AP-Classifier [2] KNN C4.5 JRIP


Accuracy p Accuracy p Accuracy p Accuracy p
Balance 86.35 ± 2.27 11 85.7 ± 2.13 1 74.15 ± 2.42 0.5 76.05 ± 2.85 9
Car 94.16 ± 4.11 11 91.5 ± 1.95 1 93.14 ± 1.95 0.5 89.13 ± 2.55 8
Monk1 99.77 ± 0.71 7 94.82 ± 2.37 3 94.53 ± 2.35 0.1 93.75 ± 2.48 2
Monk2 99.77 ± 0.7 11 82.49 ± 7.56 1 80.79 ± 0.78 0.1 80.56 ± 0.82 3
Monk3 99.63 ± 0.7 9 98.38 ± 1.41 3 98.41 ± 1.41 0.1 98.27 ± 1.42 2
Breast Cancer 73.68 ± 6.36 10 83.0 ± 3.19 6 82.47 ± 3.93 0.1 80.3 ± 7.01 3
Voting 94.73 ± 3.72 7 93.89 ± 2.31 2 96.12 ± 2.02 0.3 96.1 ± 2.04 3
Hayes-Roth 79.29 ± 9.3 7 74.62 ± 8.84 1 80.43 ± 12.53 0.1 88.52 ± 8.8 2
W. B. Cancer 97.01 ± 3.35 4 96.82 ± 1.22 3 97.37 ± 1.23 0.5 96.56 ± 2.19 5

This comparison firstly shows the interest of analogical proportions as a clas-


sification tool for some datasets and secondly as way for enlarging datasets for
other cases. Identifying on which dataset each of these methods may be better
applied should be deeply investigated in future.
In terms of complexity, the proposed Analogical Interpolation approaches
(which are quadratic due to the use of pairs of examples) if combined with the
IBK classifier for example (which is linear), leads to a improved classifier. This
latter shows better classification accuracy and enjoining reduced complexity if
compared to the AP-classifier having cubic complexity (that may be computa-
tionally costly for large datasets [2]).
An Analogical Interpolation Method for Enlarging a Training Dataset 151

6 Conclusion
This paper has studied the idea of enlarging a training set using analogical
proportions as in [4], with two main differences: we only consider pairs of exam-
ples by using continuous analogical proportions which contribute to reduce the
complexity to be quadratic instead of cubic, and we test with ordered nominal
datasets instead of Boolean one.
On the one hand the results obtained by classical machine learning methods
on the enlarged training set generally improve those obtained by applying these
methods to the original training sets. On the other hand, these results, obtained
with a smaller level of complexity, are often not so far from those obtained by
directly applying the analogical proportion-based classification method on the
original training set [2].

References
1. Bayoudh, S., Mouchère, H., Miclet, L., Anquetil, E.: Learning a classifier with very
few examples: analogy based and knowledge based generation of new examples for
character recognition. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S.,
Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 527–
534. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74958-5 49
2. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approximate Reasoning 91, 36–55 (2017)
3. Bounhas, M., Prade, H., Richard, G.: Oddness-based classification: a new way of
exploiting neighbors. Int. J. Intell. Syst. 33(12), 2379–2401 (2018)
4. Couceiro, M., Hug, N., Prade, H., Richard, G.: Analogy-preserving functions: a way
to extend Boolean samples. In: Proceedings 26th International Joint Conference
on Artificial Intelligence, IJCAI 2017, Melbourne, 19–25 August, pp. 1575–1581
(2017)
5. Derrac, J., Schockaert, S.: Inducing semantic relations from conceptual spaces: a
data-driven approach to plausible reasoning. Artif. Intell. 228, 66–94 (2015)
6. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
7. Goodfellow, I., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M.,
Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Informa-
tion Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014)
8. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification.
Technical report, Department of Computer Science, National Taiwan University
(2010)
9. Inoue, H.: Data augmentation by pairing samples for images classification. CoRR
abs/1801.02929 (2018). http://arxiv.org/abs/1801.02929
10. Lieber, J., Nauer, E., Prade, H., Richard, G.: Making the best of cases by approxi-
mation, interpolation and extrapolation. In: Cox, M.T., Funk, P., Begum, S. (eds.)
ICCBR 2018. LNCS (LNAI), vol. 11156, pp. 580–596. Springer, Cham (2018).
https://doi.org/10.1007/978-3-030-01081-2 38
11. Mertz, J., Murphy, P.: UCI repository of machine learning databases (2000).
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
152 M. Bounhas and H. Prade

12. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
13. Miclet, L., Prade, H.: Handling analogical proportions in classical logic and fuzzy
logics settings. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI),
vol. 5590, pp. 638–650. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-02906-6 55
14. Perfilieva, I., Dubois, D., Prade, H., Esteva, F., Godo, L., Hodáková, P.: Inter-
polation of fuzzy data: analytical approach and overview. Fuzzy Sets Syst. 192,
134–158 (2012)
15. Prade, H., Richard, G.: From analogical proportion to logical proportions. Logica
Universalis 7(4), 441–505 (2013)
16. Prade, H., Richard, G.: Analogical proportions: from equality to inequality. Int. J.
Approximate Reasoning 101, 234–254 (2018)
17. Prade, H., Schockaert, S.: Completing rule bases in symbolic domains by analogy
making. In: Galichet, S., Montero, J., Mauris, G. (eds.) Proceedings 7th Conference
European Society for Fuzzy Logic and Technology (EUSFLAT), Aix-les-Bains, 18–
22 July, pp. 928–934. Atlantis Press (2011)
18. Schockaert, S., Prade, H.: Interpolation and extrapolation in conceptual spaces: a
case study in the music domain. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011.
LNCS, vol. 6902, pp. 217–231. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-23580-1 16
19. Schockaert, S., Prade, H.: Qualitative reasoning about incomplete categorization
rules based on interpolation and extrapolation in conceptual spaces. In: Benferhat,
S., Grant, J. (eds.) SUM 2011. LNCS (LNAI), vol. 6929, pp. 303–316. Springer,
Heidelberg (2011). https://doi.org/10.1007/978-3-642-23963-2 24
20. Schockaert, S., Prade, H.: Interpolative and extrapolative reasoning in proposi-
tional theories using qualitative knowledge about conceptual spaces. Artif. Intell.
202, 86–131 (2013)
21. Schockaert, S., Prade, H.: Interpolative reasoning with default rules. In: Rossi, F.
(ed.) IJCAI 2013, Proceedings 23rd International Joint Conference on Artificial
Intelligence, Beijing, 3–9 August, pp. 1090–1096 (2013)
22. Schockaert, S., Prade, H.: Completing symbolic rule bases using betweenness and
analogical proportion. In: Prade, H., Richard, G. (eds.) Computational Approaches
to Analogical Reasoning: Current Trends. SCI, vol. 548, pp. 195–215. Springer,
Heidelberg (2014). https://doi.org/10.1007/978-3-642-54516-0 8
23. Wolf, L., Martin, I.: Regularization through feature knock out. MIT Computer
Science and Artificial Intelligence Laboratory (CBCL Memo 242) (2004)
Towards a Reconciliation Between
Reasoning and Learning - A Position
Paper

Didier Dubois and Henri Prade(B)

IRIT - CNRS, 118 route de Narbonne, 31062 Toulouse Cedex 09, France
{dubois,prade}@irit.fr

Abstract. The paper first examines the contours of artificial intelli-


gence (AI) at its beginnings, more than sixty years ago, and points out
the important place that machine learning already had at that time. The
ambition of AI of making machines capable of performing any informa-
tion processing task that the human mind can do, means that AI should
cover the two modes of human thinking: the instinctive (reactive) one
and the deliberative one. This also corresponds to the difference between
mastering a skill without being able to articulate it and holding some
pieces of knowledge that one can use to explain and teach. In case a
function-based representation applies to a considered AI problem, the
respective merits of learning a universal approximation of the function
vs. a rule-based representation are discussed, with a view to better draw
the contours of AI. Moreover, the paper reviews the relative positions of
knowledge and data in reasoning and learning, and advocates the need
for bridging the two tasks. The paper is also a plea for a unified view of
the various facets of AI as a science.

1 Introduction

What is artificial intelligence (AI) about? What are the research topics that
belong to AI? What are the topics that stand outside? In other words, what
are the contours of AI? Answers to these questions may have evolved with time,
as did the issue of the proper way (if any) of doing AI. Indeed over time, AI
has been successively dominated by logical approaches (until the mid 1990’s)
giving birth to the so-called “symbolic AI”, then by (Bayesian) probabilistic
approaches, and since recently by another type of numerical approach, artificial
neural networks. This state of facts has contributed to developing antagonistic
feelings between different schools of thought, including claims of supremacy of
some methods over others, rather than fostering attempts to understand the
A preliminary version of this paper was presented at the 2018 IJCAI-ECAI workshop
“Learning and Reasoning: Principles & Applications to Everyday Spatial and Temporal
Knowledge”, Stockholm, July 13–14.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 153–168, 2019.
https://doi.org/10.1007/978-3-030-35514-2_12
154 D. Dubois and H. Prade

potential complementarity of approaches. Moreover, when some breakthrough


takes place in some sector of AI such as expert systems in the 1980’s, or fuzzy
logic in the 1990’s (outside mainstream AI), or yet deep learning [51] nowadays,
it is presented through its technological achievements rather than its actual
scientific results. So we may even - provocatively - wonder: Is AI a science, or
just a bunch of engineering tools? In fact, AI has developed over more than sixty
years in several directions, and many different tools have been proposed for a
variety of purposes. This increasing diversity, rather than being a valuable asset,
may be harmful for an understanding of AI as a whole, all the more so as most
AI researchers are highly specialized in some area and are largely ignoring the
rest of the field.
Besides, beyond the phantasms and fears teased by the phrase ‘artificial
intelligence’, the meaning of words such as ‘intelligence’, ‘learning’, or ‘reason-
ing’ has a large spectrum and may refer to quite different facets of human mind
activities, which contributes to blur the meaning of what we claim when we
are using the acronym AI. Starting with ‘intelligence’, it is useful to remember
the dichotomy popularized in [44] between two modes of thinking: “System 1”
which is fast, instinctive and emotional, while “System 2” is slower, more delib-
erative, and more logical. See [76] for an illustration of similar ideas in the area
of radiological diagnosis, where “super-experts” provide correct diagnosis, even
on difficult cases, without any deliberation, while “ordinary experts” may hesi-
tate, deliberate on the difficult cases and finally make a wrong diagnosis. Yet, a
“super-expert” is able to explain what went wrong to an “ordinary expert” and
what important features should have been noticed in the difficult cases.
Darwiche [21] has recently pointed out that what is achieved by deep leaning
corresponds to tasks that do not require much deliberation, at least for a top
expert, and is far from covering all that may be expected from AI. In other words,
the system is mastering skills rather than being also able to elaborate knowledge
for thinking and communicating about its skills. This is the difference between
an excellent driver (without teaching capability) and a driving instructor.
The intended purpose of this paper is to advocate in favor of a unified view
of AI both in terms of problems and in terms of methods. The paper is orga-
nized as follows. First, in Sect. 2 a reminder on the history of the early years of
AI emphasizes the idea that the diversity of AI has been there from its incep-
tion. Then Sect. 3 first discusses relations between a function-based view and a
rule-based view of problems, in relation with “modeling versus explaining” con-
cerns. The main paradigms of AI are then restated and the need for a variety of
approaches ranging from logic to probability and beyond is highlighted. Section 4
reviews the roles of knowledge and data both in reasoning and in machine learn-
ing. Then, Sect. 5 points out problems where bridging reasoning and learning
might be fruitful. Section 6 calls for a unified view of AI, a necessary condition
for letting it become a mature science.
Towards a Reconciliation Between Reasoning and Learning 155

2 A Short Reminder of the Beginnings of AI


To have a better understanding of AI, it may be useful to have a historical view
of the emergence of the main ideas underling it [53,54,64]. We only focus here
on its beginnings. Still it is worth mentioning that exactly three hundreds years
before the expression ‘artificial intelligence’ was coined, the English philoso-
pher Thomas Hobbes of Malmesbury (1588–1679) described human thinking
as a symbolic manipulation of terms similar to mathematical calculation [39].
Indeed, he wrote “Per Ratiocinationem autem intelligo computationem.” (or in
English one year later “By ratiocination I mean computation.”) The text con-
tinues with “Now to compute, is either to collect the sum of many things that
are added together, or to know what remains when one thing is taken out of
another. Ratiocination, therefore, is the same with addition and subtraction;”
One page after one reads: “We must not therefore think that computation, that
is, ratiocination, has place only in numbers, as if man were distinguished from
other living creatures (which is said to have been the opinion of Pythagoras) by
nothing but the faculty of numbering; for magnitude, body, motion, time, degrees
of quality, action, conception, proportion, speech and names (in which all the
kinds of philosophy consist) are capable of addition and subtraction.” Such a
description appears retrospectively quite consonant with what AI programs are
trying to do!
In the late 1940’s with the advent of cybernetics [96], the introduction of
artificial neural networks [56]1 , the principle of synaptic plasticity [37] and the
concept of computing machines [91] lead to the idea of thinking machines with
learning capabilities. In 1950, the idea of machine intelligence appeared in a
famous paper by Turing [92], while Shannon [89] was investigating the possibility
of a program playing chess, and the young Zadeh [97] was already suggesting
multiple-valued logic as a tool for the conception of thinking machines.
As it is well-known, the official birthday act of AI corresponds to a research
program whose application for getting a financial support, was written in the
summer of 1955, and entitled “A proposal for the Dartmouth summer research
project on artificial intelligence” (thus putting the name of the new field in the
title!); it was signed by the two fathers of AI, John McCarthy (1927–2011),
and Marvin Minsky (1927–2016), and their two mentors Nathaniel Rochester
(1919–2001) (who designed the IBM 701 computer and was also interested in
neural network computational machines), and Claude Shannon (1916–2001) [55]
(in 1950 he was already the founder of digital circuit design theory based on
Boolean logic, the founder of information theory, but also the designer of an
electromechanical mouse (Theseus) capable of searching through the corridors
of a maze until reaching a target and of acquiring and using knowledge from
past experience). Then a series of meetings was organized at Dartmouth College
(Hanover, New Hampshire, USA) during the summer of 1956. At that time,
McCarthy was already interested in symbolic logic representations, while Minsky

1
One would notice the word ‘logical’ in the title of this pioneering paper.
156 D. Dubois and H. Prade

had already built a neural network learning machine (he was also a friend of
Rosenblatt [79] the inventor of perceptrons).
The interests of the six other participants can be roughly divided into rea-
soning and learning concerns, they were on the one hand Simon (1916–2001),
Newell (1927–1992) [63] (together authors with John Clifford Shaw (1922–1991)
of a program The Logic Theorist able to prove theorems in mathematical logic),
and More [60] (a logician interested in natural deduction at that time), and
on the other hand Samuel (1901–1990) [81] (author of programs for checkers,
and later chess games), Selfridge (1926–2008) [84] (one of the fathers of pattern
recognition), and Solomonoff (1926–2009) [90] (already author of a theory of
probabilistic induction).
Interestingly enough, as it can be seen, these ten participants, with differ-
ent backgrounds ranging from psychology to electrical engineering, physics and
mathematics, were already the carriers of a large variety of research directions
that are still present in modern AI, from machine learning to knowledge repre-
sentation and reasoning.

3 Representing Functions and Beyond


There are two modes of representation of knowledge, that can be called respec-
tively functional and logical. The first mode consists in building a large, often
numerical, function that produces a result when triggered by some input. The
second mode consists of separate, possibly related, chunks of explicit knowledge,
expressed in some language. The current dominant machine learning paradigm
(up to noticeable exceptions) has adopted the functional approach2 , which
ensures impressive successes in tasks requiring reactiveness, at the cost of los-
ing explanatory power. Indeed, we can argue that what is learnt is know-how
or skills, rather than knowledge. The other, logical, mode of representation, is
much more adapted to the encoding of articulated knowledge, reasoning from
it, and to the production of explanations via deliberation, but its connection to
learning from data is for the most part still in infancy.
A simple starting point for discussing relationships between learning and
reasoning is to compare the machineries of a classifier and a rule-based expert
system, for diagnosis for instance. In both cases, a function-based view may
apply. On the one hand, from a set of examples (of inputs and outputs of the
function, such as pairs (symptoms, disease)) one can easily predict the disease
corresponding to a new case via its input symptoms, after learning some function
(e.g., using neural nets). On the other hand, one may have a set of expert rules
stating that if the values of the inputs are such and such, the global evaluation
should be in some subset. Such rules are mimicking the function. If collected
from an expert, rules may turn out to be much less successful than the function
learned from data. Clearly, the first view may provide better approximations
and does not require the elicitation of expert rules, which is costly. However,
the explanatory power will be poor in any case, because it will not be possible
2
Still this function-based approach is often cast in a probabilistic modeling paradigm.
Towards a Reconciliation Between Reasoning and Learning 157

to answer “why not” questions and to articulate explanations based on causal


relations. On the contrary, if causal knowledge is explicitly represented in the
knowledge base, it has at least the merit of offering a basis for explanations (in a
way that should be cognitively more appropriate for the end-user). It is moreover
well-known that causal information cannot easily be extracted from data: only
correlations can be laid bare if no extra information is added [66].
The fuzzy set literature offers early examples of the replacement of an auto-
matic control law by a set of rules. Indeed Zadeh [98] proposed to use fuzzy
expert rules for controlling complex non linear dynamic systems that might be
difficult to model using a classical automatic control approach, while skilled
humans can do the job. This was rapidly shown to be successful [52]. The fact of
using fuzzy rules, rather than standard Boolean if-then rules, had the advantage
of providing a basis for an interpolation mechanism, when an input was firing
several rules to some degree. Although the approach was numerical and quite
far from the symbolic logic-based AI mainstream trend in those times, it was
perceived as an AI-inspired approach, since it was relying on the representation
of expert know-how by chunks of knowledge, rather than on the derivation of a
control law from the modeling of the physical system to be controlled (i.e., the
classical control engineering paradigm). After some time, it was soon recognized
that fuzzy rules could be learnt rather than obtained from experts, while keeping
excellent results thanks to the property of universal approximation possessed by
sets of fuzzy rules. Mathematical models of such fuzzy rules are in fact closely
related to neural network radial basis functions. But, fuzzy rules thus obtained
by learning may become hardly intelligible. This research trend, known under
the names of ‘soft computing’ or ‘computational intelligence’, thus often drifted
away from an important AI concern, the explainability power; see [27] for a
discussion.
The long term ambition of AI is to make machines capable of performing any
information processing task the human mind can perform. This certainly includes
recognition, identification, decision and diagnosis tasks (including sophisticated
ones). They are “System 1” tasks (using Kahneman terminology) as long as
we do not need to explain and reason about obtained results. But there are
other problems that are not fully of this kind, even if machine learning may
also play a role in their solving. Consider for instance the solving of quadratic
equations. Even if we could predict, in a bounded domain, by machine learning
techniques, whether an equation has zero, one or two solutions and what are their
values (with a good approximation) from a large amount of examples, the solving
of such equations by discovering their analytical solution(s), via factorization
through symbolic calculations, seems to be a more powerful way of handling of
the problem (the machine could then teach students).
AI problems cannot always be viewed in terms of the function-based view
mentioned above. There are cases where we do not have a function, only a one-to-
many mapping, e.g., when finding all the solutions (if any) of a set of constraints.
Apart from solving combinatorial problems, tasks such as reasoning about static
or dynamical situations, or building action plans, or explaining results, commu-
158 D. Dubois and H. Prade

nicating explanations pertaining to machine decisions in a meaningful way to


an end-user, or analyzing arguments and determining their possible weakness,
or understanding what is going on in a text, a dialog in natural langage, in an
image, a video, or finding relevant information and summarizing it are examples
that may require capabilities beyond pure machine learning. This is why AI, over
the years, has developed general representation settings and methods capable of
handling large classes of situations, while mastering computation complexity.
Thus, at least five general paradigms have emerged in AI:

– Knowledge representation with symbolic or numerical structured settings


for representing knowledge or preferences, such as logical languages, graph-
ical representations like Bayesian networks, or domain ontologies describing
taxonomy of concepts. Dedicated settings have been also developed for the
representation of temporal or spatial information, of uncertain pieces of infor-
mation, or of independence relations.
– Reasoning and decision Different types of reasoning tasks, beyond classical
deduction, have been formalized such as: non monotonic reasoning for deal-
ing with exception-tolerant rules in the presence of incomplete information,
or reasoning from inconsistent information, or belief revision, belief updating,
information fusion in the presence of conflicts, or formal argumentation han-
dling pros and cons, or yet reasoning directly from data (case-based reasoning,
analogical reasoning, interpolation, extrapolation). Models for qualitative (or
quantitative) decision from compact representations have been proposed for
decision under uncertainty, multiple criteria, or group decisions.
– General algorithms for problem solving This covers a panoply of generic
tools ranging from heuristic ordered search methods, general problem solver
techniques, methods for handling constraints satisfaction problems, to effi-
cient algorithms for classical logic inference (e.g., SAT methods), or for deduc-
tion in modal and other non-classical logics.
– Learning The word ‘learning’ also covers different problems, from the clas-
sification of new items based on a set of examples (and counter-examples),
the induction of general laws describing concepts, the synthesis of a function
by regression, the clustering of similar data (separating dissimilar data into
different clusters) and the labelling of clusters, to reinforcement learning and
to the discovery of regularities in data bases and data mining. Moreover, each
of these problems can often be solved by a variety of methods.
– Multiple agent AI Under this umbrella, there are quite different problems
such as: the cooperation between human or artificial agents and the organi-
zation of tasks for achieving collective goals, the modeling of BDI agents
(Belief, Desire, Intention), possibly in situations of dialogue (where, e.g.,
agents, which have different information items at their disposal, do not pur-
sue the same goals, and try to guess the intentions of the other ones), or the
study of the emergence of collective behaviors from the behaviors of elemen-
tary agents.
Towards a Reconciliation Between Reasoning and Learning 159

4 Reasoning with Knowledge or with Data


In the above research areas, knowledge and data are often handled separately.
In fact, AI traditionally deals with knowledge rather than with data, with the
important exception of machine learning, whose aim can sometimes be viewed
as changing data into knowledge. Indeed, basic knowledge is obtained from data
by induction, while prior background knowledge may help learning machineries.
These remarks suggest that the joint handling of knowledge and data is a general
issue, and that combining reasoning and learning methods should be seriously
considered.
Rule-based systems, or ontologies expressed by means of description logics,
or yet Bayesian networks, represent background knowledge that is useful to make
prediction from facts and data. In these reasoning tasks, knowledge as well as
data is often pervaded with uncertainty. This has been extensively investigated.
Data, provided that they are reliable, are positive in nature since their exis-
tence manifests the actual possibility of what is observed or reported. This con-
trasts with knowledge that delimit the extent of what is potentially possible by
specifying what is impossible (which has thus a negative flavor). This is why
reasoning from both knowledge and data goes much beyond the application of
generic knowledge to factual data as in expert systems, and even the separate
treatment of knowledge and data in description logics via ‘TBox’ and ‘ABox’
[4]. It is is a complex issue, which has received little attention until now [93].
As pointed out in [71], reasoning directly with data has been much less stud-
ied. The idea of similarity naturally applies to data and gives birth to specific
forms of reasoning such as case-based reasoning [45], case-based decision [35], or
even case-based argumentation. “Betweenness” and similarity are at the basis
of interpolation mechanisms, while analogical reasoning, which may be both a
matter of similarity and dissimilarity, provides a mechanism for extrapolation.
A well-known way of handling similarity and interpolation is to use fuzzy rules
(where fuzzy set membership degrees capture the idea of similarity w.r.t. the
core value(s) of the fuzzy set) [67]. Besides, analogical reasoning, based on ana-
logical proportions (i.e., statements of the form “a is to b as c is to d”, where
items a, b, c, d are represented in terms of Boolean, nominal or numerical vari-
ables), which can be logically represented [28,58,72], provides an extrapolation
mechanism that from three items a, b, c described by complete vectors, amounts
to inferring the missing value(s) in incomplete vector d, providing that a, b, c, d
makes an analogical proportion component-wise on the known part of d; this was
successfully applied to classification [14,18,57], and more recently to preference
learning [13,32].
Lastly, the ideas of interpolation and extrapolation closely related to analogi-
cal proportion-based inference seem to be of crucial importance in many numeri-
cal domains. They can be applied to symbolic settings in the case of propositional
categorization rules, using relations of betweenness and parallelism respectively,
under a conceptual spaces semantics [83]; see [82] for an illustration.
160 D. Dubois and H. Prade

5 Issues in Learning: Incomplete Data and Representation


Formats
The need for reasoning from incomplete, uncertain, vague, or inconsistent infor-
mation, has led to the development of new approaches beyond logic and prob-
ability. Incompleteness is a well-known phenomenon in classical logic. However,
many reasoning problems exceed the capabilities of classical logic (initially devel-
oped in relation with the foundations of mathematics where statements are true
or false, and there is no uncertainty in principle). As for probability theory,
single probability distributions, often modeled by Bayesian networks are not
fully appropriate for handling incomplete information nor epistemic uncertainty.
There are different, but related, frameworks for modeling ill-known probabilities
that were developed in the last 50 years by the Artificial Intelligence community
at large [95]: belief functions and evidence theory (which may be viewed as a
randomization of the set-based approach to incomplete information), imprecise
probability theory [3,94] (which uses convex families of probability functions)
and quantitative possibility theory (which is the simplest model since one of the
lower and the upper probability bounds is trivial).
The traditional approach for going from data to knowledge is to resort to sta-
tistical inferential methods. However, these methods used to assume data that
are precise and in sufficient quantity. The recent concern with big data seems
to even strengthen the relevance of probability theory and statistics. However
there are a number of circumstances where data is missing or is of poor quality,
especially if one tries to collect information for building machines or algorithms
supposed to face very complex or unexpected situations (e.g., autonomous vehi-
cles in crowded areas). The concern of Artificial Intelligence for reasoning about
partial knowledge has led to a questioning of traditional statistical methods when
data is of poor quality [19,38,42,43].
Besides, the fact that we may have to work with incomplete relational data
and that knowledge may also be uncertain has motivated the development of
a new probabilistic programming language first called “Probabilistic Similarity
Logic”, and then “Probabilistic Soft Logic” (PSL, for short) where each ground
atom in a rule has a truth value in [0, 1]. It uses the L  ukasiewicz t-norm and
co-t-norm to handle the fuzzy logical connectives [5,33,34]. We are close to rep-
resentation concerns of fuzzy answer set programs [61]. Besides, there is a need
for combining symbolic reasoning with the subsymbolic vector representation
of neural networks in order to use gradient descent for training the neural net-
work to infer facts from an incomplete knowledge base, using similarity between
vectors [16,17,78].
Machine learning may find some advantages to use advanced representation
formats as target languages, such as weighted logics [26] (Markov logic, proba-
bilistic logic programs, multi-valued logics, possibilistic logic, etc.). For instance,
qualitative possibility theory extends classical logic by attaching lower bounds of
necessity degrees and captures nonmonotonic reasoning, while generalized possi-
bilistic logic [30] is more powerful and can capture answer-set programming, or
reason about the ignorance of an agent. Can such kinds of qualitative uncertainty
Towards a Reconciliation Between Reasoning and Learning 161

modeling, or yet fuzzy or uncertain description logics, uncertainty representa-


tion formalisms, weighted logics, be used more extensively in machine learning?
Various answers and proposals can be found in [48–50,86,88]. This also raises
the question of extending version space learning [59] to such new representation
schemes [41,73,75].
If-then rules, in classical logic formats, are a popular representation format in
relational learning [80]. Association rules have logical and statistical bases ; they
are rules with exceptions completed by confidence and support degrees [1,36].
But, other types of rules may be of interest. Mining genuine default rules that
obey Kraus, Lehmann and Magidor postulates [47] for nonmonotonic reasoning
relies on the discovery of big-stepped probabilities [8] in a database [9]. Multiple
threshold rules, i.e., rules describing how a global evaluation depends on multiple
criteria evaluations on linearly ordered scales, such as, e.g., selection rules of the
form “if x1 ≥ a1 and · · · and xn ≥ an then y ≥ b” play a central role in ordinal
classification [46] and can be represented by Sugeno integrals or their extensions
[15,74]. Gradual rules, i.e., statements of the form “the more x is A, the more y is
B”, where A, and B are fuzzy sets, are another representation format of interest
[65,87]. Other types of fuzzy rules may provide a rule-based interpretation [20]
for neural nets, which may be also related to non-monotonic inference [7,22]. All
these examples indicates the variety of rules that makes sense and be considered
both in reasoning and in learning.
Another trend of research has been also motivated by the extraction of sym-
bolic knowledge from neural networks [22] under the form of nonmonotonic rules.
The goal of a neuro-symbolic integration has been pursued with the proposal of
a connectionist modal logic, where extended modal logic programs are trans-
lated into neural network ensembles, thus providing a neural net view of, e.g.,
the muddy children problem [24]. Following a similar line of thought, the same
authors translate a logic program encoding an argumentation network, which is
then turned into a neural network for arguments [23]. A more recent series of
works [25,85,86] propose another form of integration between logic and neural
nets using a so-called “Real Logic”, implemented in deep Tensor Neural Net-
works, for integrating deductive reasoning and machine learning. The semantics
of the logical constants is in terms of vectors of real numbers, and first order
logic formulas have degrees of truth in [0, 1] handled with L  ukasiewicz multiple-
valued logic connectives. Somewhat related is a work on ontology reasoning [40]
where the goal is to generate a neural network with binary outputs that, given a
database storing tuples of the form (subject, predicate, object), is able, for any
input literal, to decide the entailment problem for a logic program describing the
ontology. Others look for an exact representation of a binarized neural network
as a Boolean formula [62].
The use of degrees of truth multiple-valued logic raises the question of
the exact meaning of these degrees. In relation with this kind of work, some
have advocated a non-probabilistic view of uncertainty [11], but strangely
enough without any reference to the other uncertainty representation frame-
works! Maybe more promising is the line of research initiated a long time ago by
162 D. Dubois and H. Prade

Pinkas [68,69] where the idea of penalty logic (related to belief functions [31]) has
been developed in relation with neural networks, where penalty weights reflect
priorities attached to logical constraints to be satisfied by a neural network
[70]. Penalty logics and Markov logic [77] are also closely related to possibilistic
logic [30].
Another intriguing question would be to explore possible relations between
spikes neurons [12], which are physiologically more plausible than classical artifi-
cial neural networks, and fire when conjunctions of thresholds are reached, with
Sugeno integrals (then viewed as a System 1-like black box) and their logical
counterparts [29] (corresponding to a System 2-like representation).

6 Conclusion

Knowledge representation and reasoning on the one hand, and machine learning
on the other hand, have been developed largely as independent research trends
in artificial intelligence in the last three decades. Yet, reasoning and learning
are two basic capabilities of the human mind that do interact. Similarly the two
corresponding AI research areas may benefit from mutual exchanges. Current
learning methods derive know-how from data in the form of complex functions
involving many tuning parameters, but they should also aim at producing artic-
ulated knowledge, so that repositories, storing interpretable chunks of informa-
tion, could be fed from data. More precisely, a number of logical-like formalisms,
whose explanatory capabilities could be exploited, have been developed in the
last 30 years (non-monotonic logics, modal logics, logic programming, probabilis-
tic and possibilistic logics, many-valued logics, etc.) that could be used as target
languages for learning techniques, without restricting to first-order logic, nor to
Bayes nets.
Interfacing classifiers with human users may require some ability to provide
high level explanations about recommendations or decisions that are understand-
able by an end-user. Reasoning methods should handle knowledge and informa-
tion extracted from data. The joint use of (supervised or unsupervised) machine
learning techniques and of inference machineries raises new issues. There is a
number of other points, worth mentioning, which have not be addressed in the
above discussions:

– Teachability A related issue is more generally how to move from machine


learning models to knowledge communicated to humans, about the way the
machine proceeds when solving problems.
– Using prior knowledge Another issue is a more systematic exploitation of
symbolic background knowledge in machine learning devices. Can prior causal
knowledge help exploiting data and getting rid of spurious correlations? Can
an argumentation-based view of learning be developed?
– Representation learning Data representation impacts the performance of
machine learning algorithms [10]. In that respect, what may be, for instance,
the role of vector space embeddings, or conceptual spaces?
Towards a Reconciliation Between Reasoning and Learning 163

– Unification of learning paradigms Would it be possible to bridge learning


paradigms from transduction to inductive logic programming? Even including
formal concept analysis, or rough set theory?

This paper has especially advocated the interest of a cooperation between two
basic areas of AI: knowledge representation and reasoning on the one hand and
machine learning on the other hand, reflecting the natural cooperation between
two modes, respectively reactive and deliberative, of human intelligence. It is also
a plea for maintaining a unified view of AI, all facets of which have been present
from the very beginning, as recalled in Sect. 2 of this paper. It is time that AI
comes of age as a genuine science, which means ending unproductive rivalries
between different approaches, and fostering a better shared understanding of the
basics of AI through open-minded studies bridging sub-areas in a constructive
way. In the same spirit, a plea for a unified view of computer science can be found
in [6]. Mixing, bridging, hybridizing advanced ideas in knowledge representation,
reasoning, and machine learning or data mining should renew basic research in
AI and contribute in the long term to a more unified view of AI methodology.
The interested reader may follow the work in progress of the group “Amel”
[2] aiming at a better mutual understanding of research trends in knowledge
representation, reasoning and machine learning, and how they could cooperate.

Acknowledgements. The authors thank Emiliano Lorini, Dominique Longin, Gilles


Richard, Steven Schockaert, Mathieu Serrurier for useful exchanges on some of the
issues surveyed in this paper. This work was partially supported by ANR-11-LABX-
0040-CIMI (Centre International de Mathématiques et d’Informatique) within the pro-
gram ANR-11-IDEX-0002-02, project ISIPA.

References
1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets
of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings 1993
ACM SIGMOD International Conference on Management of Data, Washington,
DC, 26–28 May 1993, pp. 207–216. ACM Press (1993)
2. Amel, K.R.: From shallow to deep interactions between knowledge representation,
reasoning and machine learning. In: BenAmor, N., Theobald, M. (eds.) Proceedings
13th International Conference Scala Uncertainity Mgmt (SUM 2019), Compiègne,
LNCS, 16–18 December 2019. Springer, Heidelberg (2019)
3. Augustin, T., Coolen, F.P.A., De Cooman, G., Troffaes, M.C.M.: Introduction to
Imprecise Probabilities. Wiley, Hoboken (2014)
4. Baader, F., Horrocks, I., Lutz, C., Sattler, U.: An Introduction to Description
Logic. Cambridge University Press, Cambridge (2017)
5. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random
fields and probabilistic soft logic. J. Mach. Learn. Res. 18, 109:1–109:67 (2017)
6. Bajcsy, R., Reynolds, C.W.: Computer science: the science of and about informa-
tion and computation. Commun. ACM 45(3), 94–98 (2002)
7. Balkenius, C., Gärdenfors, P.: Nonmonotonic inferences in neural networks. In:
Proceedings 2nd International Conference on Principle of Knowledge Representa-
tion and Reasoning (KR 1991), Cambridge, MA, pp. 32–39 (1991)
164 D. Dubois and H. Prade

8. Benferhat, S., Dubois, D., Prade, H.: Possibilistic and standard probabilistic
semantics of conditional knowledge bases. J. Log. Comput. 9(6), 873–895 (1999)
9. Benferhat, S., Dubois, D., Lagrue, S., Prade, H.: A big-stepped probability app-
roach for discovering default rules. Int. J. Uncert. Fuzz. Knowl.-based Syst.
11(Suppl.–1), 1–14 (2003)
10. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
11. Besold, T.R., Garcez, A.D.A., Stenning, K., van der Torre, L., van Lambalgen,
M.: Reasoning in non-probabilistic uncertainty: logic programming and neural-
symbolic computing as examples. Minds Mach. 27(1), 37–77 (2017)
12. Bichler, O., Querlioz, D., Thorpe, S.J., Bourgoin, J.-P., Gamrat, C.: Extraction
of temporally correlated features from dynamic vision sensors with spike-timing-
dependent plasticity. Neural Netw. 32, 339–348 (2012)
13. Bounhas, M., Pirlot, M., Prade, H., Sobrie, O.: Comparison of analogy-based
methods for predicting preferences. In: BenAmor, N., Theobald, M. (eds.) Pro-
ceedings 13th International Conference on Scala Uncertainity Mgmt (SUM 2019),
Compiègne, LNCS, 16–18 December. Springer, Heidelberg (2019)
14. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approx. Reasoning 91, 36–55 (2017)
15. Brabant, Q., Couceiro, M., Dubois, D., Prade, H., Rico, A.: Extracting decision
rules from qualitative data via sugeno utility functionals. In: Medina, J., Ojeda-
Aciego, M., Verdegay, J.L., Pelta, D.A., Cabrera, I.P., Bouchon-Meunier, B., Yager,
R.R. (eds.) IPMU 2018. CCIS, vol. 853, pp. 253–265. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-91473-2 22
16. Cohen, W.W.: TensorLog: a differentiable deductive database. CoRR,
abs/1605.06523 (2016)
17. Cohen, W.W., Yang, F., Mazaitis, K.: TensorLog: deep learning meets probabilistic
DBs. CoRR, abs/1707.05390 (2017)
18. Couceiro, M., Hug, N., Prade, H., Richard, G.: Analogy-preserving functions: a way
to extend Boolean samples. In: Proceedings 26th International Joint Conference
on Artificial Intelligence, (IJCAI 2017), Melbourne, 19–25 August, pp. 1575–1581
(2017)
19. Couso, I., Dubois, D.: A general framework for maximizing likelihood under incom-
plete data. Int. J. Approx. Reasoning 93, 238–260 (2018)
20. d’Alché-Buc, F., Andrés, V., Nadal, J.-P.: Rule extraction with fuzzy neural net-
work. Int. J. Neural Syst. 5(1), 1–11 (1994)
21. Darwiche, A.: Human-level intelligence or animal-like abilities?. CoRR,
abs/1707.04327 (2017)
22. d’Avila Garcez, A.S., Broda, K., Gabbay, D.M.: Symbolic knowledge extraction
from trained neural networks: a sound approach. Artif. Intell. 125(1–2), 155–207
(2001)
23. d’Avila Garcez, A.S., Gabbay, D.M., Lamb, L.C.: Value-based argumentation
frameworks as neural-symbolic learning systems. J. Logic Comput. 15(6), 1041–
1058 (2005)
24. d’Avila Garcez, A.S., Lamb, L.C., Gabbay, D.M.: Connectionist modal logic: repre-
senting modalities in neural networks. Theor. Comput. Sci. 371(1–2), 34–53 (2007)
25. Donadello, I., Serafini, L., Garcez, A.D.A.: Logic tensor networks for semantic
image interpretation. In: Sierra, C. (ed) Proceedings 26th International Joint Con-
ference on Artificial Intelligence (IJCAI 2017), Melbourne, 19–25 August 2017, pp.
1596–1602 (2017)
Towards a Reconciliation Between Reasoning and Learning 165

26. Dubois, D., Godo, L., Prade, H.: Weighted logics for artificial intelligence - an
introductory discussion. Int. J. Approx. Reasoning 55(9), 1819–1829 (2014)
27. Dubois, D., Prade, H.: Soft computing, fuzzy logic, and artificial intelligence. Soft
Comput. 2(1), 7–11 (1998)
28. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
29. Dubois, D., Prade, H., Rico, A.: The logical encoding of Sugeno integrals. Fuzzy
Sets Syst. 241, 61–75 (2014)
30. Dubois, D., Prade, H., Schockaert, S.: Generalized possibilistic logic: foundations
and applications to qualitative reasoning about uncertainty. Artif. Intell. 252, 139–
174 (2017)
31. Dupin de Saint-Cyr, F., Lang, J., Schiex, T.: Penalty logic and its link with
Dempster-Shafer theory. In: de Mántaras, R.L., Poole, D. (eds.) Proceedings 10th
Annual Conference on Uncertainty in Artificial Intelligence (UAI 1994), Seattle,
29–31 July, pp. 204–211 (1994)
32. Fahandar, M.A., Hüllermeier, E.: Learning to rank based on analogical reasoning.
In: Proceedings 32th National Conference on Artificial Intelligence (AAAI 2018),
New Orleans, 2–7 February 2018 (2018)
33. Fakhraei, S., Raschid, L., Getoor, L.: Drug-target interaction prediction for drug
repurposing with probabilistic similarity logic. In: SIGKDD 12th International
Workshop on Data Mining in Bioinformatics (BIOKDD). ACM (2013)
34. Farnadi, G., Bach, S.H., Moens, M.F., Getoor, L., De Cock, M.: Extending PSL
with fuzzy quantifiers. In: Papers from the 2014 AAAI Workshop Statistical Rela-
tional Artificial Intelligence, Québec City, 27 July, pp. WS-14-13, 35–37 (2014)
35. Gilboa, I., Schmeidler, D.: Case-based decision theory. Q. J. Econ. 110, 605–639
(1995)
36. Hájek, P., Havránek, T.: Mechanising Hypothesis Formation - Mathematical Foun-
dations for a General Theory. Springer, Heidelberg (1978). https://doi.org/10.
1007/978-3-642-66943-9
37. Hebb, D.O.: The Organization of Behaviour. Wiley, Hoboken (1949)
38. Heitjan, D., Rubin, D.: Ignorability and coarse ckata. Ann. Statist. 19, 2244–2253
(1991)
39. Hobbes, T.: Elements of philosophy, the first section, concerning body. In:
Molesworth, W. (ed.) The English works of Thomas Hobbes of Malmesbury, vol.
1. John Bohn, London, 1839. English translation of ”Elementa Philosophiae I. De
Corpore” (1655)
40. Hohenecker, P., Lukasiewicz, T.: Ontology reasoning with deep neural networks.
CoRR, abs/1808.07980 (2018)
41. Hüllermeier, E.: Inducing fuzzy concepts through extended version space learning.
In: Bilgiç, T., De Baets, B., Kaynak, O. (eds.) IFSA 2003. LNCS, vol. 2715, pp.
677–684. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44967-1 81
42. Hüllermeier, E.: Learning from imprecise and fuzzy observations: data disambigua-
tion through generalized loss minimization. Int. J. Approx. Reasoning 55(7), 1519–
1534 (2014)
43. Jaeger, M.: Ignorability in statistical and probabilistic inference. JAIR 24, 889–917
(2005)
44. Kahneman, D.: Thinking, Fast and Slow. Farrar, Straus and Giroux, New York
(2011)
45. Kolodner, J.L.: Case-Based Reasoning. Morgan Kaufmann, Burlington (1993)
46. Kotlowski, W., Slowinski, R.: On nonparametric ordinal classification with mono-
tonicity constraints. IEEE Trans. Knowl. Data Eng. 25(11), 2576–2589 (2013)
166 D. Dubois and H. Prade

47. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential mod-
els and cumulative logics. Artif. Intell. 44, 167–207 (1990)
48. Kuzelka, O., Davis, J., Schockaert, S.: Encoding Markov logic networks in pos-
sibilistic logic. In: Meila, M., Heskes, T. (eds.) Proceedings 31st Conference on
Uncertainty in Artificial Intelligence (UAI 2015), Amsterdam, 12–16 July 2015,
pp. 454–463. AUAI Press (2015)
49. Kuzelka, O., Davis, J., Schockaert, S.: Learning possibilistic logic theories from
default rules. In: Kambhampati, S. (ed.) Proceedings 25th International Joint Con-
ference on Artificial Intelligence (IJCAI 2016), New York, 9–15 July 2016, pp.
1167–1173 (2016)
50. Kuzelka, O., Davis, J., Schockaert, S.: Induction of interpretable possibilistic logic
theories from relational data. In: Sierra, C. (ed.) Proceedings 26th International
Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, 19–25 August
2017, pp. 1153–1159 (2017)
51. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444
(2015)
52. Mamdani, E.H., Assilian, S.: An experiment in linguistic synthesis with a fuzzy
logic controller. Int. J. Man-Mach. Stu. 7, 1–13 (1975)
53. Marquis, P., Papini, O., Prade, H.: Eléments pour une histoire de l’intelligence
artificielle. In: Panorama de l’Intelligence Artificielle. Ses Bases Méthodologiques,
ses Développements, vol. I, pp. 1–39. Cépaduès (2014)
54. Marquis, P., Papini, O., Prade, H.: Some elements for a prehistory of Artificial
Intelligence in the last four centuries. In: Proceedings 21st Europoen Conference
on Artificial Intelligence (ECAI 2014), Prague, pp. 609–614. IOS Press (2014)
55. McCarthy, J., Minsky, M., Roch-ester, N., Shannon, C.E.: A proposal for the Dart-
mouth summer research project on artificial intelligence, august 31, 1955. AI Mag.
27(4), 12–14 (2006)
56. McCulloch, W.S., Pitts, W.: A logical calculus of ideas immanent in nervous activ-
ity. Bull. Math. Biophys. 5, 115–133 (1943)
57. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
58. Miclet, L., Prade, H.: Handling analogical proportions in classical logic and fuzzy
logics settings. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI),
vol. 5590, pp. 638–650. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-02906-6 55
59. Mitchell, T.: Version spaces: an approach to concept learning. Ph.D. thesis, Stan-
ford (1979)
60. More, T.: On the construction of Venn diagrams. J. Symb. Logic 24(4), 303–304
(1959)
61. Mushthofa, M., Schockaert, S., De Cock, M.: Solving disjunctive fuzzy answer set
programs. In: Calimeri, F., Ianni, G., Truszczynski, M. (eds.) LPNMR 2015. LNCS
(LNAI), vol. 9345, pp. 453–466. Springer, Cham (2015). https://doi.org/10.1007/
978-3-319-23264-5 38
62. Narodytska, N.: Formal analysis of deep binarized neural networks. In: Lang, J.
(ed.) Proceedings 27th International Joint Conference Artificial Intelligence (IJCAI
2018), Stockholm, 13–19 July 2018, pp. 5692–5696 (2018)
63. Newell, A., Simon, H.A.: The logic theory machine. a complex information pro-
cessing system. In: Proceedings IRE Transactions on Information Theory(IT-2),
The Rand Corporation, Santa Monica, Ca, 1956. Report P-868, 15 June 1956, pp.
61-79, September 1956
Towards a Reconciliation Between Reasoning and Learning 167

64. Nilsson, N.J.: The Quest for Artificial Intelligence : A History of Ideas andAchieve-
ments. Cambridge University Press, Cambridge (2010)
65. Nin, J., Laurent, A., Poncelet, P.: Speed up gradual rule mining from stream data!
A B-tree and OWA-based approach. J. Intell. Inf. Syst. 35(3), 447–463 (2010)
66. Pearl, J.: Causality, vol. 2000, 2nd edn. Cambridge University Press, Cambridge
(2009)
67. Perfilieva, I., Dubois, D., Prade, H., Esteva, F., Godo, L., Hodáková, P.: Inter-
polation of fuzzy data: analytical approach and overview. Fuzzy Sets Syst. 192,
134–158 (2012)
68. Pinkas, G.: Propositional non-monotonic reasoning and inconsistency in symmetric
neural networks. In: Mylopoulos, J., Reiter, R. (eds.) Proceedings 12th Interna-
tional Joint Conference on Artificial Intelligence, Sydney, 24–30 August 1991, pp.
525–531. Morgan Kaufmann (1991)
69. Pinkas, G.: Reasoning, nonmonotonicity and learning in connectionist networks
that capture propositional knowledge. Artif. Intell. 77(2), 203–247 (1995)
70. Pinkas, G., Cohen, S.: High-order networks that learn to satisfy logic constraints.
FLAP J. Appl. Logics IfCoLoG J. Logics Appl. 6(4), 653–694 (2019)
71. Prade, H.: Reasoning with data - a new challenge for AI? In: Schockaert, S., Senel-
lart, P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 274–288. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-45856-4 19
72. Prade, H., Richard, G.: From analogical proportion to logical proportions. Logica
Universalis 7(4), 441–505 (2013)
73. Prade, H., Rico, A., Serrurier, M.: Elicitation of sugeno integrals: a version space
learning perspective. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS
2009. LNCS (LNAI), vol. 5722, pp. 392–401. Springer, Heidelberg (2009). https://
doi.org/10.1007/978-3-642-04125-9 42
74. Prade, H., Rico, A., Serrurier, M., Raufaste, E.: Elicitating sugeno integrals:
methodology and a case study. In: Sossai, C., Chemello, G. (eds.) ECSQARU
2009. LNCS (LNAI), vol. 5590, pp. 712–723. Springer, Heidelberg (2009). https://
doi.org/10.1007/978-3-642-02906-6 61
75. Prade, H., Serrurier, M.: Bipolar version space learning. Int. J. Intell. Syst. 23,
1135–1152 (2008)
76. Raufaste, E.: Les Mécanismes Cognitifs du Diagnostic Médical : Optimisation et
Expertise. Presses Universitaires de France (PUF), Paris (2001)
77. Richardson, M., Domingos, P.M.: Markov logic networks. Mach. Learn. 62(1–2),
107–136 (2006)
78. Rocktäschel, T., Riedel, S.: End-to-end differentiable proving. In: Guyon, I., et al.
(eds.) Proceedings 31st Annual Conference on Neural Information Processing Sys-
tems (NIPS 2017), Long Beach, 4–9 December 2017, pp. 3791–3803 (2017)
79. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and
organization in the brain. Psychol. Rev. 65(6), 386–408 (1958)
80. Rückert, U., De Raedt, L.: An experimental evaluation of simplicity in rule learn-
ing. Artif. Intell. 172(1), 19–28 (2008)
81. Samuel, A.: Some studies in machine learning using the game of checkers. IBM J.
3, 210–229 (1959)
82. Schockaert, S., Prade, H.: Interpolation and extrapolation in conceptual spaces: a
case study in the music domain. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011.
LNCS, vol. 6902, pp. 217–231. Springer, Heidelberg (2011). https://doi.org/10.
1007/978-3-642-23580-1 16
168 D. Dubois and H. Prade

83. Schockaert, S., Prade, H.: Interpolative and extrapolative reasoning in proposi-
tional theories using qualitative knowledge about conceptual spaces. Artif. Intell.
202, 86–131 (2013)
84. Selfridge, O.G.: Pandemonium: a paradigm for learning. In: Blake, D.V., Uttley,
A.M. (ed) Symposium on Mechanisation of Thought Processes, London, 24–27
November 1959, vol. 1958, pp. 511–529 (1959)
85. Serafini, L., Garcez, A.S.A.: Logic tensor networks: deep learning and logical rea-
soning from data and knowledge. In: Besold, T.R., Lamb, L.C., Serafini, L., Tabor,
W. (eds.) Proceedings 11th International Workshop on Neural-Symbolic Learning
and Reasoning (NeSy 2016), New York City, 16–17 July 2016, vol. 1768 of CEUR
Workshop Proceedings (2016)
86. Serafini, L., Donadello, I., Garcez, A.S.A.: Learning and reasoning in logic tensor
networks: theory and application to semantic image interpretation. In: Seffah, A.,
Penzenstadler, B., Alves, C., Peng, X. (eds.) Proceedings Symposium on Applied
Computing (SAC 2017), Marrakech, 3–7 April 2017, pp. 125–130. ACM (2017)
87. Serrurier, M., Dubois, D., Prade, H., Sudkamp, T.: Learning fuzzy rules with their
implication operators. Data Knowl. Eng. 60(1), 71–89 (2007)
88. Serrurier, M., Prade, H.: Introducing possibilistic logic in ILP for dealing with
exceptions. Artif. Intell. 171(16–17), 939–950 (2007)
89. Shannon, C.E.: Programming a computer for playing chess. Philos. Mag. (7th
series) XLI (314), 256–275 (1950)
90. Solomonoff, R.J.: An inductive inference machine. Tech. Res. Group, New York
City (1956)
91. Turing, A.M.: Intelligent machinery. Technical report, National Physical Labora-
tory, London, 1948. Also. In: Machine Intelligence, vol. 5, pp. 3–23. Edinburgh
University Press (1969)
92. Turing, A.M.: Computing machinery and intelligence. Mind 59, 433–460 (1950)
93. Ughetto, L., Dubois, D., Prade, H.: Implicative and conjunctive fuzzy rules - a
tool for reasoning from knowledge and examples. In: Hendler, J., Subramanian,
D. (eds.) Proceedings 16th National Confernce on Artificial Intelligence, Orlando,
18–22 July 1999, pp. 214–219 (1999)
94. Walley, P.: Statistical Reasoning with Imprecise Probabilities. Chapman and Hall,
London (1991)
95. Walley, P.: Measures of uncertainty in expert systems. Artif. Intell. 83(1), 1–58
(1996)
96. Wiener, N.: Cybernetics or Control and Communication in the Animal and the
Machine. Wiley, Hoboken (1949)
97. Zadeh, L.A.: Thinking machines - a new field in electrical engineering. Columbia
Eng. Q. 3, 12–13 (1950)
98. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and
decision processes. IEEE Trans. Syst. Man Cybern. 3(1), 28–44 (1973)
CP-Nets, π-pref Nets, and Pareto
Dominance

Nic Wilson1(B) , Didier Dubois2 , and Henri Prade2


1
Insight Centre for Data Analytics, School of Computer Science and IT,
University College Cork, Cork, Ireland
[email protected]
2
IRIT-CNRS Université Paul Sabatier, Toulouse, France

Abstract. Two approaches have been proposed for the graphical han-
dling of qualitative conditional preferences between solutions described
in terms of a finite set of features: Conditional Preference networks (CP-
nets for short) and more recently, Possibilistic Preference networks (π-
pref nets for short). The latter agree with Pareto dominance, in the sense
that if a solution violates a subset of preferences violated by another
one, the former solution is preferred to the latter one. Although such an
agreement might be considered as a basic requirement, it was only con-
jectured to hold as well for CP-nets. This non-trivial result is established
in the paper. Moreover it has important consequences for showing that
π-pref nets can at least approximately mimic CP-nets by adding explicit
constraints between symbolic weights encoding the ceteris paribus pref-
erences, in case of Boolean features. We further show that dominance
with respect to the extended π-pref nets is polynomial.

1 Introduction

Ceteris Paribus Conditional Preference Networks (CP-nets, for short) [5,6] were
introduced in order to provide a convenient tool for the elicitation of multidi-
mensional preferences and accordingly compare the relative merits of solutions
to a problem. They are based on three assumptions: only ordinal information is
required; the preference statements deal with the values of single decision vari-
ables in the context of fixed values for other variables that influence them; pref-
erences are provided all else being equal (ceteris paribus). CP-nets were inspired
by Bayesian networks (they use a dependency graph, most of the time a directed
acyclic one, whose vertices are variables) but differ from them by being quali-
tative, by their use of the ceteris paribus assumption, and by the fact that the
variables in a CP-net are decision variables rather than random variables. In
the most common form of CP-nets, each preference statement in the prefer-
ence graph translates into a strict preference between two solutions (i.e., value
assignment to all decision variables) differing on a single variable (referred to as
a worsening flip) and the dominance relation between solutions is the transitive
closure of this worsening flip relation.

c Springer Nature Switzerland AG 2019


N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 169–183, 2019.
https://doi.org/10.1007/978-3-030-35514-2_13
170 N. Wilson et al.

Another kind of conditional preference network, called π-pref nets, has been
more recently introduced [1], and is directly inspired by the counterpart of
Bayesian networks in possibility theory, called possibilistic networks [3]. A Π-pref
net shares with CP-nets its directed acyclic graphical structure between decision
variables, and conditional preference statements attached to each variable in the
contexts defined by assignments of its parent variables in the graph. The prefer-
ence for one value against another is captured by assigning degrees of possibility
(here interpreted as utilities) to these values. When the only existing prefer-
ences are those expressed by the conditional statements (there are no preference
statements across contexts or variables), it has been shown that the dominance
relation between solutions is obtained by comparing vectors of symbolic utility
values (one per variables) using Pareto-dominance.
Some results comparing the preference relations between solutions obtained
from CP-nets and π-pref nets with Boolean decision variables are given in [1].
This is made easy by the fact that CP-nets and π-pref nets share the graph struc-
ture and the conditional preference tables. It was shown that the two obtained
dominance relations between solutions cannot conflict with each other (there
is no preference reversal between them), and that ceteris paribus information
can be added to π-pref nets in the form of preference statements between spe-
cific products of symbolic weights. One pending question was to show that the
dominance relation between solutions obtained from a CP-net refines the prefer-
ence relation obtained from the corresponding π-pref net. In the case of Boolean
variables, the π-pref net ordering can be viewed as a form of Pareto ordering:
each assignment of a decision variables is either good (= in agreement with
the preference statement) or bad. The pending question comes down to prove
a monotonicity condition for the preference relation on solutions, stating that
as soon as a solution contains more (in the sense of inclusion) good variable
assignments than another solution, it should be strictly preferred by the CP-
net. Strangely enough this natural question has hardly been addressed in the
literature so far (see [2] for some discussion). The aim of this paper is to solve
this problem, and more generally to compare the orderings of solutions using the
two preference modeling structures.
We further show that dominance with respect to extended π-pref nets can
be computed in polynomial time, using linear programming; it thus forms a
polynomial upper approximation for the CP-net dominance relation.
The paper is structured as follows: In Sect. 2 we define a condition, that we
call local dominance, that is shown to be a sufficient condition for dominance
in a CP-net. The follow two sections, Sects. 3 and 4, make use of this sufficient
condition in producing results that show that a form of Pareto ordering is a
lower bound for a lower bound for CP-net dominance. Section 5 then uses the
results of Sect. 4 to show that π-pref nets dominance is a lower bound for CP-net
dominance. We also show there that the extended π-pref nets dominance, which
is an upper bound for CP-net dominance, can be computed in polynomial time.
Section 6 concludes.
CP-Nets, π-pref Nets, and Pareto Dominance 171

2 A Sufficient Condition for Dominance in a CP-Net


We start by recalling the definition of CP-nets and a characterization of the
corresponding dominance relation between solutions.

2.1 Defining CP-nets


We consider a finite set of variables V. Each variable X ∈ V has an associ-
ated finite domain Dom(X). An outcome (also called a solution) is a complete
assignment to the variables in V, i.e., a function w that, for each variable X ∈ V,
w(X) ∈ Dom(X).
A CP-net Σ over set of variables V is a pair G, P . The first component G
is a directed graph with vertices in V, and we say that CP-net Σ is acyclic if G
is acyclic. For variable X ∈ V, let UX be the set of parents of X in G, i.e., the
set of variables Y such that (Y, X) is an edge in G. The second component P of
Σ consists of a collection of partial orders {>X u : X ∈ V, u ∈ Dom(UX )}, called
conditional preference tables; for each variable X ∈ V and each assignment u to
the parents UX of X, relation >X u is a strict partial order (i.e., a transitive and
irreflexive relation) on Dom(X). We make the assumption that for each variable
X there exists at least one assignment u to UX such that >X u is non-empty
(i.e., for each X ∈ V there exists some x, x ∈ Dom(X) and some u such that
x >X 
u x ).
Let w be an outcome and, for variable X ∈ V, let u = w(UX ) be the projection
of w to the parents set of X. If x >X 
u x then we shall write, for simplicity, (with

the understanding that x and x are elements of Dom(X)):
x > x given w [with respect to Σ].
Note that if v is any outcome whose projection to the parents set of X is also
u then [x > x given v] if and only if [x > x given w]; the values of w(Y ) and
u(Y ) may differ for variables Y ∈ UX ∪ {X}, but the preference between x and
x in the context u does not depend on Y .
We say that Σ is locally totally ordered if each associated strict partial order
>X 
u is a strict total order, so that for each pair of different elements x and x of
X   X
Dom(X), we have either x >u x or x >u x. We say that Σ is Boolean if for
each X ∈ V, each domain has exactly two elements: |Dom(X)| = 2.1

The Dominance Relation Associated with a CP-Net. Given a CP-net Σ


over variables V, we say that w is a worsening flip from w w.r.t. Σ, if w and w
are outcomes that differ on exactly one variable X ∈ V (so that w (X) = w(X)
and for all Y ∈ V \ {X}, w (Y ) = w(Y )), and w(X) >X 
u w (X), where u is the
projection of w (or w ) to the parent set UX of X.
The set of direct consequences of CP-net Σ are the set of pairs (w, w ), where
w is a worsening flip from w w.r.t. Σ, forming an irreflexive relation:
1
If a variable has only one element in its domain, it is a constant, and we could remove
it if we wished.
172 N. Wilson et al.

Definition 1. The worsening flip relation >Σ Σ 


wf is defined by w >wf w if and
X  
only if w(X) >u w (X) where u = w(UX ) = w (UX ). Let the binary relation
>Σ Σ Σ 
cp on outcomes denote the transitive closure of >wf . If w >cp w we say that

w [cp-]dominates w [with respect to Σ].

The relation >Σ


wf is well-defined due to the ceteris paribus assumption. A
sequence of outcomes w1 , . . . , wk is said to be a worsening flipping sequence
[with respect to CP-net Σ] from w1 to wk if, for each i = 1, . . . , k − 1, wi+1
is a worsening flip from wi . Thus, w cp-dominates w if and only if there is a
worsening flipping sequence from w to w .

2.2 Some Simple Conditions for CP-Dominance

For outcomes w and v we define Δ(w, v) to be the set of variables on which


they differ, i.e., {X ∈ V : w(X) = v(X)}. The following lemma gives two simple
sufficient conditions for w to dominate v with respect to a CP-net. In Case (i),
for each variable X in Δ(w, v), there is a worsening flip from w, changing w(X)
to v(X). In Case (ii) cflip from v changing v(X) to w(X).

Lemma 1. Consider an acyclic CP-net Σ and two different outcomes w and v.


Then w cp-dominates v w.r.t. CP-net Σ if either
(i) for all X ∈ Δ(w, v), w(X) > v(X) given w; or
(ii) for all X ∈ Δ(w, v), w(X) > v(X) given v.

Proof. Let k = |Δ(w, v)|, which is greater than zero because w = v. Let us
label the elements of Δ(w, v) as X1 , . . . , Xk in such a way that if i < j then
Xi is not an ancestor of Xj with respect to the CP-net directed graph; this
is possible because of the acyclicity assumption on Σ. To prove (i), beginning
with outcome w, we flip variables of w to v in the order X1 , . . . , Xk , so that we
first change w(X1 ) to v(X1 ), and then change w(X2 ) to v(X2 ), and so on. The
choice of variable ordering means that when we flip variable Xi the assignment
to the parents UXi of Xi is just w(UXi ). It can be seen that this is a sequence
of worsening flips from w to v, and thus, w cp-dominates v w.r.t. Σ.
Part (ii) is very similar, except that we start with v, and iteratively change
Xi from v(Xi ) to w(Xi ) in the order i = 1, . . . , k. The assumption behind part
(ii) implies that we obtain an improving flipping sequence from v to w. 2

Lemma 1 can be used to prove a more general form of itself.

Proposition 1. Consider an acyclic CP-net Σ and two different outcomes w


and v. Assume that for each X ∈ Δ(w, v) either w(X) > v(X) given w w.r.t. Σ,
or w(X) > v(X) given v w.r.t. Σ. Then w cp-dominates v w.r.t. Σ.

Proof. Define outcome u by u(X) = v(X) if X is such that w(X) > v(X) given
w (so X ∈ Δ(w, v)), and u(X) = w(X) otherwise. Then Δ(w, v) is the disjoint
union of Δ(w, u) and Δ(u, v).
CP-Nets, π-pref Nets, and Pareto Dominance 173

For all X ∈ Δ(w, u), w(X) > u(X) given w, because u(X) = v(X) and
w(X) > v(X) given w. Lemma 1 implies that w cp-dominates u w.r.t. Σ.
For all X ∈ Δ(u, v), u(X) > v(X) given v, since u(X) = w(X) and u(X) >
v(X) given w. Lemma 1 implies that u cp-dominates v w.r.t. Σ. Thus, w cp-
dominates v w.r.t. Σ. 2

2.3 The Local Dominance Relation

The conditions of Proposition 1 involve what might be called a local dominance


condition.
Definition 2. Given an acyclic CP-net Σ we say that outcome w locally dom-
inates outcome v [w.r.t. CP-net Σ], written w >Σ
LD v, if for each X ∈ Δ(w, v)
either w(X) > v(X) given w w.r.t. Σ; or w(X) > v(X) given v w.r.t. Σ.
Proposition 1 above implies that if w locally dominates v then w cp-dominates
v, so that w >Σ Σ
LD v implies w >cp v. In fact, we even have the following result.

Proposition 2. Given an acyclic CP-net Σ, binary relation >Σ


cp is the transi-
tive closure of >Σ
LD .

Proof. Let  be the transitive closure of >Σ Σ


LD . Since, by Proposition 1, >LD is
a subset of >cp , and the latter is transitive, we have that  is a subset of >Σ
Σ
cp .
Suppose that w is a worsening flip from w w.r.t. Σ. Then, w(X) > w (X)
given w and Δ(w, w ) = {X}, which implies that w locally dominates w . This
shows that >Σ Σ
LD , and thus, , contains the worsening flip relation >wf induced
by Σ. Being transitive,  contains the transitive closure >Σ Σ
cp of >wf . We have
Σ Σ
therefore shown that >cp equals , the transitive closure of >LD . 2

3 Pareto Ordering for CP-Nets in the General Case

A Pareto Ordering between outcomes comes down to saying that w dominates


w if ∀X ∈ V, w(X) is at least as good an assignment as w (X) (and better
for some X). However, it is not so easy to define Pareto dominance between
outcomes in a CP-net when variables are not Boolean. It is often impossible
to compare w(X) and w (X) directly as there is generally no relation >X u that
compares them. To perform this kind of comparison in the general case of a
dependency graph, we must in some way map the various preference relations
>Xu on Dom(X) to some common scale, either totally (using a scoring function)
or partially on some landmark values (mapping the best choices or the worst
choices). We define a somewhat extreme Pareto-like relation, using the latter
idea, below. As mentioned in Sect. 1, and discussed in detail in Sect. 4, the more
natural form of Pareto dominance applies only for the case of Boolean CP-nets.
174 N. Wilson et al.

3.1 A Variant of Pareto Dominance for CP-nets


We define relation >Σsp on the set of outcomes, which can be viewed as being
based on a strong variant of the Pareto condition (with sp standing for strong
Pareto).

Definition 3 (Fully dominating and fully dominated). For outcome w,


we say that x is fully dominating in X given w [w.r.t. Σ] if x ∈ Dom(X), and
for all x ∈ Dom(X) \ {x} we have x > x given w w.r.t. Σ.
Similarly, we say that x is fully dominated in X given w [w.r.t. Σ] if x ∈
Dom(X), and |Dom(X)| > 1 and for all x ∈ Dom(X) \ {x} we have x > x
given w w.r.t. Σ.

Thus, if x is fully dominating in X given w then x is not fully dominated


in X given w. Also, there can at most one element x ∈ Dom(X) that is fully
dominating in X given w, and at most one that is fully dominated in X given
w.
We define irreflexive relation >Σ Σ
sp by, for different outcomes w and v, w >sp v
if and only if for all X ∈ V either v(X) is fully dominated in X given v w.r.t. Σ;
or w(X) is fully dominating in X given w w.r.t. Σ.
In the case in which the local relations >X u are total orders, then the def-
initions can be simplified. Consider any outcome w, and value x in Dom(X),
and let u be the projection of w to the parent set of X. Let x∗u and xu∗ be the
best and the worst element (respectively) in Dom(X) for relation >X u . Then x
is fully dominating in X given w if and only if x = x∗u , and x is fully domi-
nated in X given w if and only if |Dom(X)| > 1 and x = xu∗ . Another way
of defining the >Σ X
sp relation then consists, for each relation >u , of mapping
Dom(X) to a three-valued totally ordered scale L = {1, I, 0} with 1 > I > 0
using a kind of qualitative scoring function fuX : Dom(X) → L defined by
fuX (x∗u ) = 1, fuX (xu∗ ) = 0, and fuX (x) = I otherwise. Note that relation w >Σsp w


expresses a very strong form of Pareto-dominance, since it requires that not only
w = w and fuX (w(X)) ≥ fuX (w (X)), but also that either fuX (w(X)) = 1 or
fuX (w(X)) = 0, ∀X ∈ V.

Proposition 3. Relation >Σ Σ Σ


sp is transitive, and is contained in >LD , i.e., w >sp
Σ Σ Σ Σ Σ
v implies w >LD v, and thus >sp ⊆ >LD ⊆ >cp . Furthermore, we have >sp and
>Σ Σ Σ
cp are equal (i.e., are the same relation) if and only if >sp and >LD are equal.

Proof. We will prove transitivity of >Σ Σ


sp by showing that if w1 >sp w2 and
Σ Σ
w2 >sp w3 then w1 >sp w3 . Consider any X ∈ V such that w3 (X) is not
fully dominated in X given w3 . Since w2 >Σ sp w3 , we have that w2 (X) is fully
dominating in X given w2 , and so w2 (X) is not fully dominated in X given w2 .
Since w1 >Σ sp w2 , we have that w1 (X) is fully dominating in X given w1 . Thus,
for all X ∈ V, if w3 (X) is not fully dominated in X given w3 then w1 (X) is fully
dominating in X given w1 , and hence, w1 >Σ sp w3 , proving transitivity.
Σ
Now, suppose that w1 >sp w2 , and consider any X ∈ V. Either (i) w2 (X)
is fully dominated in X given w2 , and thus, w1 (X) > w2 (X) given w2 ; or (ii)
CP-Nets, π-pref Nets, and Pareto Dominance 175

w1 (X) is fully dominating in X given w1 , and thus, w1 (X) > w2 (X) given w1 ;
therefore we have w1 >Σ LD w2 .
Clearly if >Σsp and > Σ Σ Σ Σ
cp are equal then the inclusions >sp ⊆ >LD ⊆ >cp imply
that >Σ Σ Σ Σ
sp and >LD are equal. Conversely, assume that >sp and >LD are equal.
Σ Σ
We then have that >LD is transitive (since >sp is transitive), and thus it is equal
to its transitive closure, which equals >Σ cp by Proposition 2. 2

3.2 Necessary and Sufficient Conditions for Equality of >Σ


sp
and >Σ
cp

We will show that >Σ Σ


sp and >cp are only equal under extremely special conditions,
including that the CP-net is unconditional and that each domain has at most
two elements. We use a series of lemmas to prove the result.
The first lemma follows easily using the transitivity of >Σ
sp .

Lemma 2. Given CP-net Σ, then we have >Σ Σ


sp equals >cp if and only if for all
pairs (w, w ) such that w is a worsening flip from w we have w >Σ
  
sp w .

Proof. We need to prove that >Σ Σ Σ


sp equals >cp if and only if >sp contains the
Σ Σ
worsening flip relation >wf induced by Σ. Since >cp is the transitive closure of
>Σ Σ Σ Σ Σ
wf , if >sp equals >cp then >sp contains >wf .
Regarding the converse, assume that >Σ Σ
sp contains >wf . Since, by
Proposition 3, >Σ Σ
sp is transitive, then >sp contains the transitive closure >cp
Σ
Σ Σ Σ Σ Σ
of >wf . Proposition 3 implies that >sp is a subset of >cp , so >sp equals >cp . 2
The definition of >Σ
sp leads to the following characterisation. Suppose that w


is a worsening flip from w w.r.t. CP-net Σ, with X being the variable on which
they differ. Then w >Σ 
sp w if and only if (a) either w(X) is fully dominating in
X given w w.r.t. Σ, or w (X) is fully dominated in X given w w.r.t. Σ; and (b)
for all Y ∈ V \ {X},
(i) if Y is not a child of X then w(Y ) is either fully dominated or fully domi-
nating in Y given w w.r.t. Σ; and
(ii) if Y is a child of X then w(Y ) is either fully dominating in Y given w
w.r.t. Σ or fully dominated in Y given w w.r.t. Σ.
The above considerations lead to the following result.
Lemma 3. Consider any X ∈ V, and any assignment u to the parents of X, and
any values x, x ∈ Dom(X) such that x >X  Σ 
u x . Assume that w >sp w whenever
(w, w ) is an associated worsening flip, i.e., if w(X) = x and w (X) = x , and
 

w and w agree on all other variables, and w extends u. Let (v, v  ) be one such
associated worsening flip.
If variable Z is not a child of X and z is any element of Dom(Z) then z
is either fully dominated or fully dominating in X given v w.r.t. Σ. We have
|Dom(Z)| ≤ 2.
If variable Y is a child of X and y is any element of Dom(Y ) then y is either
fully dominating given v or fully dominated given v  . We have |Dom(Y )| ≤ 2.
176 N. Wilson et al.

Note that the condition |Dom(Z)| ≤ 2 follows since there can be at most one
fully dominated and at most one fully dominating element in X given v.
Lemma 3 implies that for any variable X, every other variable has at most two
values, which immediately implies that every domain has at most two elements:

Lemma 4. Suppose that >Σ Σ


sp and >cp are equal. Then each domain has at most
two values.

Definition 4 (True parents and being unconditional). Let Y be a variable


and let X be an element of its parent set UY . We say that X is not a true parent
of Y if for all assignments u and u to UY that differ only on the value of X, if
y >Yu y  then y >Yu y  . We say that Y is unconditional in Σ if it has no true
parents.

If X is not a true parent of Y then >Yu does not depend on X. For any
CP-net Σ we can generate an equivalent CP-net (i.e., that generates the same
ordering on outcomes) such that every parent of every variable is a true parent.

Lemma 5. Suppose that >Σ Σ


sp and >cp are equal. Assume that every parent of
variable Y is unconditional, and let X be one such parent. Suppose that u is
some assignment to the parents of Y , and that u is another assignment that
differs from u only on the value of X. If y >Yu y  then y >Yu y  .

Proof. Suppose that y >Yu y  . Let v be any outcome extending u and let v  be
any outcome extending u . Lemma 4 implies that X has at most two values. If X
had only one value then it is trivially not a true parent of Y , so we can assume
that Dom(X) = 2. X is unconditional so it has no parents. Our definition of
a CP-net implies that the relation >X is non-empty, so we have x1 >X x2 , for
some labelling x1 and x2 of the values of X. We first consider the case in which
u(X) = x1 . Now, y  is not fully dominating given u and so, by Lemma 3, y  is
fully dominated given u , which implies y >Yu y  .
We now consider the other case in which u(X) = x2 . Then, y is not fully
dominated given u, and so, by Lemma 3, y is fully dominating given u , and
thus, also y >Yu y  . 2

Lemma 5 implies that X is not a true parent of Y . Since X was an arbi-


trary parent of Y , it then implies that Y has no true parent, so is unconditional.
Applying this result inductively then implies that every variable in V is uncondi-
tional with respect to Σ. Along with Lemmas 3 and 4, this leads to the following
result.

Proposition 4. Given CP-net Σ, then we have >Σ Σ


sp equals >cp if and only
if Σ be a Boolean locally totally ordered CP-net such that each variable X is
unconditional in Σ.
CP-Nets, π-pref Nets, and Pareto Dominance 177

4 Pareto Ordering for the Boolean Case


As discussed earlier, there is a natural way of defining a Pareto ordering for
the case of Boolean locally totally ordered CP-nets. Basically, if variables are
Boolean, each of its values is either fully dominating or fully dominated in each
context. So, the relation >Σ sp becomes a full-fledged Pareto ordering. In this
section we analyse the relationship between this Pareto ordering and the CP-net
ordering.
Let Σ be a Boolean locally totally ordered CP-net. Consider any outcome
w. We say that variable X is bad for w if there is an improving flip of variable
X from w to another outcome w . Define Fw to be the set of variables which are
bad for w.
The definition of Fw and of the local dominance relation (see Sect. 2.3) imme-
diately leads to the following expression of >Σ LD in the Boolean locally totally
ordered case.
Lemma 6. Let Σ be a Boolean locally totally ordered CP-net. Then, for different
outcomes w and v, we have w >Σ
LD v if and only if Fw ∩ Δ(w, v) ⊆ Fv .

Proof. w >Σ LD v if and only if for each X ∈ Δ(w, v) either w(X) > v(X) given
w, or w(X) > v(X) given v. For X ∈ Δ(w, v), we have w(X) > v(X) given w if
and only if X ∈ / Fw ; and we have w(X) > v(X) given v if and only if X ∈ Fv .
Thus, w >Σ LD v if and only if for each X ∈ Δ(w, v) [ X ∈ Fw ⇒ X ∈ Fv ], which
is if and only if Fw ∩ Δ(w, v) ⊆ Fv . 2
We define the irreflexive binary relation >Σ
par on outcomes as follows.

Definition 5. For different outcomes w and v, w >Σ par v if and only if Fw ⊆ Fv ,


i.e., every variable that is bad for w is also bad for v.
This can be viewed as a kind of Pareto ordering, and equals the strong Pareto
relation >Σ
sp (see Sect. 3.1) for the Boolean locally totally ordered case.

Lemma 7. Let Σ be a Boolean locally totally ordered CP-net. Let w and v be


outcomes. Then w >Σ Σ
sp v if and only if w >par v.

Proof. For different w and v, w >Σ sp v if and only if for all X ∈ V either v(X) is
fully dominated in X given v, or w(X) is fully dominating in X given w.
Suppose that w >Σsp v and consider any X ∈ V. If X ∈ Fw then w(X) is not
fully dominating in X given w, and so v(X) is fully dominated in X given v,
which implies that X ∈ Fv . We have shown that Fw ⊆ Fv .
Conversely, assume that Fw ⊆ Fv , and consider any X ∈ V. such that v(X) is
not fully dominated in X given v. Because Σ is a Boolean locally totally ordered
CP-net this implies that X is not bad for v. Since Fw ⊆ Fv , this implies that X
is not bad for w, and so, w(X) is fully dominating in X given w. This proves
that w >Σ sp v. 2
The CP-net relation contains the Pareto relation, with the local dominance
relation being between the two.
178 N. Wilson et al.

Theorem 1. Let Σ be a Boolean locally totally ordered CP-net. Relation >Σ par
is transitive, and is contained in >Σ Σ Σ
LD , i.e., w >par v implies w >LD v, and
thus >Σ Σ Σ Σ Σ
par ⊆ >LD ⊆ >cp . Furthermore, we have >par and >cp are equal (i.e., are
Σ Σ
the same relation) if and only if >par and >LD are equal, which happens only if
every variable of the CP-net is unconditional.
Proof. Theorem 1 follows immediately from Propositions 3 and 4 and Lemma 7.
2
As a consequence, we get that CP-nets are in agreement with Pareto ordering
in the case of Boolean locally totally ordered variables: for any variable X and
any configuration u of its parents, consider the mapping fuX : Dom(X) → {0, 1}
such that fuX (x∗ ) = 1 and fuX (xu∗ ) = 0. For any two distinct outcomes w and w ,
X
we have that ∀X ∈ V, fw(U X)
(w(X)) ≥ fwX (UX ) (w (X)) if and only Fw ⊆ Fw ,
which is Pareto-ordering >Σ par .
We emphasise the following part of the theorem:
Corollary 1. Let Σ be a Boolean locally totally ordered CP-net, w >Σ
par w

Σ 
implies w >cp w .
As shown in the previous section, it does not seem straightforward to extend
this Pareto ordering in a natural way to non-Boolean variables without using
scaling functions that map all partial orders (Dom(X), >X u ), u ∈ UX to a com-
mon value scale, unless the variables are all preferentially independent from one
another. In this case, UX = ∅, ∀X, and >X X
u = > , ∀X ∈ Dom(X). We could then
define the Pareto dominance relation >par on outcomes as w >Σ
Σ 
par w if and only
 X  
if w = w and w(X) > w (X) or w(X) = w (X) for all X ∈ V.

5 π-pref Nets
Possibility theory [8] is a theory of uncertainty devoted to the representation of
incomplete information. It is maxitive (addition is replaced by maximum) in con-
trast with probability theory. It ranges from purely ordinal to purely numerical
representations. Possibility theory can be used for representing preferences [9]. It
relies on the idea of a possibility distribution π, i.e., a mapping from a universe
of discourse Ω to the unit interval [0, 1]. Possibility degrees π(w) estimate to
what extent the solution w is not unsatisfactory. π-pref nets are based on possi-
bilistic networks [3], using conditional possibilities of the form π(x|u) = Π(x∧u)
Π(u) ,
for u ∈ Dom(UX ), where Π(ϕ) = maxw|=ϕ π(w). The use of product-based con-
ditioning rather than min-based conditioning leads to possibilistic nets that are
more similar to Bayesian nets.
The ceteris paribus assumption of CP-nets is replaced in possibilistic net-
works by a chain rule like in Bayesian networks. It enables one to compute,
using an aggregation function, the degree of possibility of solutions. However it
is supposed that these numerical values are unknown and represented by sym-
bolic weights. Only ordering between symbolic values or products thereof can
CP-Nets, π-pref Nets, and Pareto Dominance 179

be expressed. The dominance relation between solutions is obtained by compar-


ing products of symbolic utility values computed for them from the conditional
preference tables.

Definition 6 ([1]). A Boolean possibilistic preference network (π-pref net) is


a preference network, where |Dom(X)| = 2, ∀X ∈ V, and each preference state-
ment x >X 
u x is associated to a conditional possibility distribution such that
π(x|u) = 1 > π(x |u)) = αX u u
, and αX is a non-instantiated variable on [0, 1)
we call a symbolic weight. One may also have indifference statements x ∼X u x,


expressed by π(x|u) = π(x |u) = 1.

π-pref nets induce a partial ordering between solutions based on the comparison
of their degrees of possibility in the sense of a joint possibility
distribution com-
puted using the product-based chain rule: π(xi , . . . , xn ) = i=1,...,n π(xi |ui ).
The preferences between solutions are of the form w π w if and only if
π(w) > π(w ) for all instantiations of the symbolic weights.

5.1 π-pref Nets Vs CP-Nets


Let us compare preference relations between solutions induced by both CP-
nets and π-pref nets. It has been shown [2] that the ordering between solutions
induced by a π-pref net corresponds to the Pareto ordering between the vectors
w = (θ1 (w), . . . , θn (w)) where θi (w) = π(w(Xi )|w(UXi )), i = 1, . . . , n.
As symbolic weights are not comparable across variables, it is easy to see that
the only way to have π(w) ≥ π(w ) is to have θk (w) ≥ θk (w ) in each component
k of w and w . Otherwise the products will be incomparable due to the presence
of distinct symbolic variables on each side. So, if w = w ,

w π w if and only if θk (w) ≥ θk (w ), k = 1, . . . , n and ∃i : θi (w) > θi (w ).

It is then known that the π-pref net ordering between solutions induced by the
preference tables is refined by comparing the sets Fw of bad variables for w:

w π w ⇒ Fw ⊂ Fw
since if two solutions contain variables having bad assignments in the sense of the
preference tables, the corresponding symbolic values may differ if the contexts
for assigning a value to this variable differ. It has been shown that if the weights
u
αX reflecting the satisfaction level due to assigning the bad value to Xi in the
context ui do not depend on this context, then we have an equivalence in the
above implication:
u
If ∀X ∈ V, αX = αX , ∀ui ∈ Dom(UX ), then w π w ⇐⇒ w >Σ 
par w .

As a consequence, using Corollary 1, it is clear that w π w implies w >Σ cp w so


that the CP-net preference ordering refines the one induced by the corresponding
Boolean π-pref net. It suggests that we can try to add ceteris paribus constraints
to a π-pref net and so as to capture the preferences expressed by a CP-net.
180 N. Wilson et al.

In the following, we highlight local constraints between each node and its
children that enable ceteris paribus to be simulated. Ceteris paribus constraints
are of the form w >Σ  
cp w where w and w differ by one flip. For each such
statement (one per variable), we add the constraint on possibility degrees π(w) >
π(w ). Using the chain rule, it corresponds to comparing products of symbolic
weights. Let Dom(UX ) = ×Xi ∈UX Dom(Xi ) denote the Cartesian product of

u
domains of variables in UX , αX = π(x− |u), where x− is bad for X and γYu =
− 
π(y |u ). Suppose a CP-net and a π-pref net built from the same preference
statements. It has been shown in [2] that the worsening flip constraints are all
induced by the conditions: ∀ X ∈ V s.t. X has children Ch(X) = ∅:
 
u
max αX < 
min γYu
u∈Dom(UX ) u ∈Dom(UY )
Y ∈Ch(X)

Let +π be the resulting preference ordering built from the preference tables and
applying constraints of the above format between symbolic weights, then, it is
clear that ω cp ω  ⇒ ω + 
π ω : relation π is a bracketing from above of the
+

CP-net ordering.

5.2 Relation +
π as a Polynomial Upper Bound for CP-Net
Dominance

In this section we give a characterisation of the relation + π in terms of deduction


of linear constraints, which implies that determining dominance with respect to
+π is polynomial. It is thus a polynomial upper bound for CP-net dominance.
We list all the different symbolic weights (not including 1) as α1 , . . . , αm ,
and let α represent the whole vector of symbolic weights [α1 , . . . , αm ].
Let a weights vector z be a vector of m real numbers [z1 , . . . , zm ] (with
each zi in {−1, 0, 1}). For each such weights vector z, we associate the product
α1z1 · · · αm
zm
, which we abbreviate as Rα [z].
A comparison between products of symbolic weights can be encoded as a
statement Rα [z] > 1. For example, a comparison α1 > α2 α3 is equivalent
to Rα [z] > 1 where z = [1, −1, −1, 0, 0, . . .], since Rα [z] = α11 α2−1 α3−1 and so
Rα [z] > 1 ⇐⇒ α11 α2−1 α3−1 > 1 ⇐⇒ α1 > α2 α3 . In this way, every ceteris
paribus statement corresponds to a set of statements Rα [z] > 1 for different
vectors z.
(i)
For each i = 1, . . . , m, define the vector z (i) as zi = −1 and for all j = i,
zj = 0. Rα [z (i) ] > 1 expresses that αi−1 > 1, i.e., αi < 1. For a CP-net Σ let
(i)

Z(Σ) be the set of weights vectors associated with symbolic weights comparisons
for each ceteris paribus statement, plus for each i = 1, . . . , m, the element z (i) .
Similarly, every solution is associated with a product of symbolic weights, so
a comparison w > w between solutions w and w corresponds to a statement
pertaining to a weights vector z  . The definitions lead easily to the following
characterisation of this form of dominance.
CP-Nets, π-pref Nets, and Pareto Dominance 181

Proposition 5. Consider any CP-net Σ with associated set of weights vectors


Z(Σ), and let w and w be two solutions, where the comparison w > w has
associated vector z  . We have that w + 
π w if and only if {Rα [z] > 1 : z ∈ Z(Σ)}

implies Rα [z ] > 1, i.e., if one replaces the values of symbolic weights αi by any
real values such that Rα [z] > 1 holds for each z ∈ Z(Σ) then Rα [z  ] > 1 also
holds.

We can write log Rα [z] as z1 λ1 + · · · + zm λm = z · λ, where λ is the vector


(λ1 , . . . , λm ) and each λi = log αi . Thus, Rα [z] > 1 ⇐⇒ log Rα [z] > 0 ⇐⇒

z · λ > 0. By Proposition 5 this implies that w + π w if and only if for vectors

λ, {z · λ > 0 : z ∈ Z(Σ)} implies z · λ > 0.
Using a standard result from convex sets, this leads to the following result,
which gives a somewhat simpler characterisation that shows that dominance is
polynomial. It also suggests potential links with Generalized Additive Iindepen-
dent (GAI) value function approximations of CP-nets [4,7].

Theorem 2. Consider any CP-net with associated set of weights vectors Z(Σ),
and let w and w be two different solutions, where w > w has associated vector
z  . We have that w + 
π w if and
 only if there exist non-negative real numbers rz
for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z  . Hence, whether or not w +π w
holds can be checked in polynomial time.

Proof. As argued above, w + π w holds if and only if for vectors λ, the set of
inequalities {z · λ > 0 : z ∈ Z(Σ)} implies z  · λ > 0. We need to show that this
holds if and only if there exist non-negative real numbers rz for each z ∈ Z(Σ)
such that z∈Z(Σ) rz z = z  . Firstly, let us assume that there exist non-negative

real numbers rz for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z  . Consider any

vector λ such that z · λ > 0 for all z ∈ Z(Σ). Then z  · λ = z∈Z(Σ) rz z · λ which
is greater than zero since each rz is non-negative, and at least some rz > 0 (else
z  is the zero vector, which would contradict w = w ).
Conversely, let us assume that  there do not exist non-negative real numbers

rz for each z ∈ Z(Σ) such that z∈Z(Σ) rz z = z . To prove that the set of

inequalities {z · λ > 0 : z ∈ Z(Σ)} does not imply z · λ > 0, we will show that
there exists a vector λ withz · λ > 0 for all z ∈ Z(Σ) but z  · λ ≤ 0. Let C be the
set of vectors of the form z∈Z(Σ) rz z over all choices of non-negative reals rz .
Now, C is a convex and closed set, which by the hypothesis does not intersect
with {z  } (i.e., does not contain z  ). Since {z  } is closed and compact we can use
a hyperplane separation theorem to show that there exists a vector λ and real
numbers c1 < c2 such that for all x ∈ C, x · λ > c2 and z  · λ < c1 . Because C
is closed under strictly positive scalar multiplication (i.e., x ∈ C implies rx ∈ C
for all real r > 0) we must have c2 ≤ 0, and x · λ ≥ 0 for all x ∈ C, and in
particular z · λ ≥ 0 for all z ∈ Z(Σ). Also, z  · λ < c1 < c2 ≤ 0 so z  · λ ≤ 0, as
required.
The last part follows since linear programming is polynomial. 2
182 N. Wilson et al.

6 Summary and Discussion


In this paper we have compared CP-nets and π-pref nets, two qualitative coun-
terparts of Bayes nets for the representation of conditional preferences. We have
studied them from the point of view of their rationality, namely whether they
respect Pareto dominance between multiple Boolean variable solutions to a deci-
sion problem expressed by such graphical models. While π-pref nets naturally
respect this property, strangely enough, it was previously unknown whether the
preference ordering induced by CP-nets respects it or not. For more general (non-
Boolean) variables, it seems difficult to extend this notion of Pareto-dominance
for a CP-net in an entirely natural way. Besides, it was shown previously that the
ordering induced by π-pref nets is weaker than the one induced by CP-nets, but
ceteris paribus constraints can be added to a π-pref net in the form of constraints
between products of symbolic variables. Here we show the polynomial nature of
this encoding. Thus we get a bracketing of the CP-net preference ordering by
bounds which are apparently easier to compute than standard CP-net prefer-
ences. Further research includes constructing an example that explicitly proves
that the upper approximation of the CP-net ordering is not tight; moreover the
case of non-Boolean variables deserves further investigation.

Acknowledgements. This material is based upon works supported by the Science


Foundation Ireland under Grants No. 12/RC/2289 and No. 12/RC/2289-P2 which are
co-funded under the European Regional Development Fund.

References
1. Ben Amor, N., Dubois, D., Gouider, H., Prade, H.: Possibilistic preference networks.
Inf. Sci. 460–461, 401–415 (2018)
2. Ben Amor, N., Dubois, D., Gouider, H., Prade, H.: Expressivity of possibilistic
preference networks with constraints. In: Moral, S., Pivert, O., Sánchez, D., Marı́n,
N. (eds.) SUM 2017. LNCS (LNAI), vol. 10564, pp. 163–177. Springer, Cham (2017).
https://doi.org/10.1007/978-3-319-67582-4 12
3. Benferhat, S., Dubois, D., Garcia, L., Prade, H.: On the transformation between
possibilistic logic bases and possibilistic causal networks. Int. J. Approx. Reasoning
29(2), 135–173 (2002)
4. Boutilier, C., Bacchus, F., Brafman, R.I.: UCP-networks: a directed graphical rep-
resentation of conditional utilities. In: Proceedings of the 17th Conference on Uncer-
tainty in AI, Seattle, Washington, USA, pp. 56–64 (2001)
5. Boutilier, C., Brafman, R.I., Hoos, H.H., Poole, D.: Reasoning with conditional
ceteris paribus preference statements. In: Proceedings of the 15th Conference on
Uncertainty in AI, Stockholm, Sweden, pp. 71–80 (1999)
6. Boutilier, C., Brafman, R.I., Domshlak, C., Hoos, H.H., Poole, D.: CP-nets: a tool for
representing and reasoning with conditional ceteris paribus preference statements.
J. Artif. Intell. Res. 21, 135–191 (2004)
7. Brafman, R.I., Domshlak, C., Kogan, T.: Compact value-function representations
for qualitative preferences. In: Proceedings of the 20th Conference on Uncertainty
in AI, Banff, Canada, pp. 51–59 (2004)
CP-Nets, π-pref Nets, and Pareto Dominance 183

8. Dubois, D., Prade, H.: Possibility Theory: An Approach to ComputerizedProcessing


of Uncertainty. Plenum Press (1988)
9. Dubois, D., Prade, H.: Possibility theory as a basis for preference propagation in
automated reasoning. In: Proceedings of the 1st IEEE International Conference on
Fuzzy Systems, San Diego, CA, pp. 821–832 (1992)
Measuring Inconsistency Through
Subformula Forgetting

Yakoub Salhi(B)

CRIL - CNRS & Université d’Artois, Lens, France


[email protected]

Abstract. In this paper, we introduce a new approach for defining


inconsistency measures. The key idea consists in forgetting subformula
occurrences in order to restore consistency. Thus, our approach can be
seen as a generalization of the approach based on forgetting only proposi-
tional variables. We here introduce rationality postulates of inconsistency
measuring that take into account in a syntactic way the internal struc-
ture of the formulas. We also describe different inconsistency measures
that are based on forgetting subformula occurrences.

1 Introduction
In this work, we are interested in quantifying conflicts for better analyzing the
nature of the inconsistency in a knowledge base. Plenty of proposals for incon-
sistency measures have been defined in the literature (e.g. see [3,7,9,14,15]),
and it has been shown that they can be applied in different domains, such as
e-commerce protocols [4], integrity constraints [6], databases [13], multi-agent
systems [10], spatio-temporal qualitative reasoning [5].
In the literature, an inconsistency measure is defined as a function that asso-
ciates a non negative value to each knowledge base. In particular, the authors
in [9] have proposed different rationality postulates for defining inconsistency
measures that allow capturing important aspects related to inconsistency in the
case of classical propositional logic. Furthermore, objections to some of them
and many new postulates have also been proposed in [1]. The main advantage of
the approach based on rationality postulates for defining inconsistency measures
is its flexibility in the sense that the appropriate measure in a given context can
be chosen through the desired properties from the existing postulates.
In [11,12], the authors have proposed a general framework for reasoning under
inconsistency by forgetting propositional variables to restore consistency. Using
the variable forgetting approach of this framework, an inconsistency measure
has been proposed in [2]. The main idea consists in quantifying the amount of
inconsistency as the minimum number of variable occurrences that have to be
forgotten to restore consistency. We here propose a new approach for defining
inconsistency measures that can be seen as a generalization of the previous app-
roach. Indeed, our main idea consists in measuring the amount of inconsistency
by considering sets of subformula occurrences that we need to forget to restore
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 184–191, 2019.
https://doi.org/10.1007/978-3-030-35514-2_14
Measuring Inconsistency Through Subformula Forgetting 185

consistency. To the best of our knowledge, we here provide the first approach
that takes into account in a syntactic way the internal structure of the formulas.
In this work, we propose rationality postulates for measuring inconsistency
that are based on reasoning about subformula occurrences. In particular, the
postulate stating that forgetting any subformula occurrence does not increase
the amount of inconsistency. Finally, we propose several inconsistency measures
that are based on forgetting subformula occurrences. These measures are defined
by considering the number of modified formulas and the size of the forgotten
subformula occurrences to restore consistency. For instance, one of the proposed
inconsistency measure quantifies the amount of inconsistency as the minimum
size of the subformula occurrences that have to be forgotten to obtain consis-
tency. It is worth mentioning that we show that two of the described incon-
sistency measures correspond to two measures existing in the literature: that
introduced in [2] based on forgetting variables and that introduced in [7] based
on consistent subsets.

2 Preliminaries
2.1 Classical Propositional Logic
We here consider that every piece of information is represented using classical
propositional logic. We use Prop to denote the set of propositional variables.
The set of propositional formulas is denoted Form. We use the letters p, q, r, s to
denote the propositional variables, and the Greek letters φ, ψ and χ to denote
the propositional formulas. Moreover, given a syntactic object o, we use P(o) to
denote the set of propositional variables occurring in o. Given a set of variables
S such that P(φ) ⊆ S, we use M od(φ, S) to denote the set of all the models of
φ defined over S.
Given a formula φ, the size of a formula φ, denoted s(φ), is inductively defined
as follows: s(p) = s(⊥) = s() = 1; s(¬ψ) = 1 + s(ψ); s(ψ ⊗ χ) = 1 + s(ψ) + s(χ)
for ⊗ = ∧, ∨, →. In other words, the size of a formula is defined as the number
of the occurrences of propositional variables, constants and logical connectives
that appear in it.
Similarly, the set of the subformulas of φ, denoted SF (φ), is inductively
defined as follows: SF (p) = {p}; SF (⊥) = {⊥}; SF () = {}; SF (¬ψ) =
{¬ψ} ∪ SF (ψ); SF (ψ ⊗ χ) = {ψ ⊗ χ} ∪ SF (ψ) ∪ SF (χ) for ⊗ = ∧, ∨, →.
Given a formula φ and ψ ∈ SF (φ), we use O(φ, ψ) to denote the number
of the occurrences of ψ in φ. Moreover, we consider that the occurrences of a
subformula are ordered starting from the left. For example, consider the formula
φ = (p ∧ q) → (¬r ∨ q). Then, SF (φ) = {φ, p ∧ q, ¬r ∨ q, ¬r, p, q, r}. Further,
O(φ, p) = 1 and O(φ, q) = 2. The first occurrence of q is that occurring in the
subformula p ∧ q, while the second is that occurring in the subformula ¬r ∨ q.
The polarity of a subformula occurrence within a formula that has a polarity
(positive or negative) is defined as follows:
– φ is a positive (resp. negative) subformula occurrence of the positive (resp.
negative) formula φ;
186 Y. Salhi

– if χ is a positive (resp. negative) subformula occurrence of φ, then χ is also


a positive (resp. negative) subformula occurrence of φ ⊗ ψ, ψ ⊗ φ, ψ → φ for
every formula ψ and for ⊗ = ∧, ∨;
– if χ is a positive (resp. negative) subformula occurrence of φ, then χ is a
negative (resp. positive) subformula occurrence of ¬φ and φ → ψ for every
formula ψ.

Consider, for instance, the formula p → (p ∨ q) with the negative polarity. Then,
the left-hand p is a positive subformula occurrence and the right-hand occurrence
is negative.
A knowledge base is a finite set of propositional
 formulas. A knowledge base
K is inconsistent if its associated formula φ∈K φ ( if K = ∅) is inconsistent,
written K ⊥, otherwise it is consistent, written K  ⊥. We use KForm to
denote
 the set of knowledge bases. Moreover, we use SF (K) to denote the set
φ∈K SF (φ).
From now on, we consider that the polarity of the formulas occurring in any
knowledge base are negative, the same results can be obtained by symmetrically
considering the positive polarity.
Given a knowledge base K, a subset K  ⊆ K is said to be a minimal inconsis-
tent subset (MIS) of K if (i) K  ⊥ and (ii) ∀φ ∈ K  , K  \ {φ}  ⊥. Moreover,
K  is said to be a maximal consistent subset (MCS) of K if (i) K   ⊥ and
(ii) ∀φ ∈ K \ K  , K  ∪ {φ} ⊥. We use M ISes(K) and M CSes(K) to denote
respectively the set of all the MISes and the set of all the MCSes of K.

2.2 Substitution

Given a formula φ and two subformula occurrences ψ and χ in φ. We say that


ψ and χ are disjoint if one does not occur in the other.
Given two propositional formulas φ and ψ, χ ∈ SF (φ) and i ∈ 1..O(φ, χ),
we use φ[(χ, i)/ψ] to denote the result of substituting the formula ψ for the
ith occurrence of χ in φ. Further, we use φ[χ/ψ], φ[(χ)+ /ψ] and φ[(χ)− /ψ] to
denote the result of substituting the formula ψ for respectively all the occur-
rences of χ, all the positive occurrences of χ and all the negative occurrences of
χ in φ. Similarly, given the formulas φ, ψ1 , . . . , ψl , χ1 , . . . χl and the expressions
e1 , . . . , el such that each ei has one of the forms (χi , j), χi , (χi )+ and (χi )− ,
φ[e1 , . . . , el /ψ1 , . . . , ψl ] is the result of simultaneously substituting ψ1 , . . . , ψl for
the subformula occurrences corresponding to the expressions e1 , . . . , el respec-
tively. It is worth mentioning that the subformula occurrences corresponding to
the expressions e1 , . . . , el should be pairwise disjoint in φ.
For instance, consider the formula φ = (p ∧ q) → (p ∨ q) with the negative
polarity. Then, φ[(p)+ , (q, 2)/(p ∧ ¬q), r] corresponds to the formula ((p ∧ ¬q) ∧
q) → (p ∨ r). Indeed, there is a unique positive occurrence of p which is on the
left-hand side of the implication and it is replaced with p ∧ ¬q; and the second
occurrence of q is on the right-hand side of the implication and it is replaced
with r.
Measuring Inconsistency Through Subformula Forgetting 187

3 Inconsistency Measure
In the literature, an inconsistency measure is defined as a function that associates
a non negative value to each knowledge base (e.g. [3,7,9,14,15]). It is used to
quantify the amount of inconsistency in a knowledge base. The different works
on inconsistency measures use postulate-based approaches to capture important
aspects related to inconsistency. In particular, in the recent work [3], the authors
have proposed the following formal definition of inconsistency measure that we
consider in this work.
Definition 1 (Inconsistency Measure). An inconsistency measure is a
function I : KForm → R+ ∞ that satisfies the two following properties: (i) ∀K ∈
KForm , I(K) = 0 iff K is consistent (Consistency); and (ii) ∀K, K  ∈ KForm , if
K ⊆ K  then I(K)  I(K  ) (M onotonicity). The set R+ ∞ corresponds to the set
of positive real numbers augmented with a greatest element denoted ∞.
The postulate (Consistency) means that an inconsistency measure must
allow distinguishing between consistent and inconsistent knowledge bases, and
(M onotonicity) means that the amount of inconsistency does not decrease by
adding new formulas to a knowledge base. Many other postulates have been
introduced in the literature to characterize particular aspects related to incon-
sistency (e.g. see [1,9,15]).
Let us now describe some simple inconsistency measures from the literature:
– IM (K) = |M ISes(K)| ([8])
– Idhit (K) = |K| − max{|K  | | K  ∈ M CSes(K)} ([7])
– Ihs (K) = min{|S| | S ⊆ M and ∀φ ∈ K, ∃B ∈ S s.t. B |= φ} − 1 with
M = φ∈K M od(φ, P(K))  and min{} = ∞ ([14])
– If orget (K) = min{n | φ∈K φ[(p1 , i1 ), . . . (pn , in )/ C1 , . . . , Cn ], p1 , . . . , pn ∈
Prop, C1 , . . . , Cn ∈ {, ⊥}} ([2])
The measure IM quantifies the amount of inconsistency through minimal incon-
sistent subsets: more MISes brings more conflicts; Idhit consider the dual of the
size of the greatest MCSes; Ihs is defined through an explicit use of the Boolean
semantics: the amount of inconsistency is related to the minimum number of
models that satisfy all the formulas in the considered knowledge base; and If orget
defines the amount of inconsistency as the minimum number of variables that
we have to forget to restore consistency. It is worth mentioning that we consider
here the reformulation of If orget proposed in [15].

4 Subformula-Based Rationality Postulates


In this section, we propose rationality postulates for measuring inconsistency
that are based on reasoning about forgetting subformula occurrences. In the same
way as in the case of the inconsistency measure If orget , we use the constants 
and ⊥ to forget subformula occurrences.
The rationality postulates that we consider are defined as follows ∀K ∈ KForm
and ∀φ ∈ Form with φ ∈ / K:
188 Y. Salhi

– (F orgetN egOcc):
1. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
negative, I(K ∪ {φ[(ψ, i)/]})  I(K ∪ φ);
2. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
negative and φ[(ψ, i)/⊥] ∈/ K, I(K ∪ φ)  I(K ∪ {φ[(ψ, i)/⊥]}).
– (F orgetP osOcc):
1. ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occurrence of ψ in φ is
positive and φ[(ψ, i)/] ∈/ K, I(K ∪ φ)  I(K ∪ {φ[(ψ, i)/]});
2. (F orgetP osOcc⊥ ) ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ) with the ith occur-
rence of ψ in φ is positive, I(K ∪ {φ[(ψ, i)/⊥]})  I(K ∪ φ).

The first property of (F orgetN egOcc) expresses the fact that a negative
subformula occurrence becomes useless to produce inconsistency if it is replaced
with . Regarding the second property, it is worth mentioning that the con-
dition φ[(ψ, i)/⊥] ∈/ K is only used to prevent formula deletion. The postulate
(F orgetP osOcc) is simply the counterpart in the case of positive subformula
occurrences of (F orgetN egOcc).
In a sense, the next proposition shows that the previous postulates can be
seen as restrictions of the postulate (Dominance), introduced in [9], in the case
of consistent formulas. Let us recall that (Dominance) is defined as follows:

– ∀K ∈ KForm and ∀φ, ψ ∈ Form with φ  ⊥ and φ ψ, I(K ∪ {φ}) 


I(K ∪ {ψ}).

Proposition 1. The following two properties are satisfied for ∀φ ∈ Form with
negative polarity and ∀ψ ∈ SF (φ) and ∀i ∈ 1..O(φ, ψ): (i) if the ith occurrence
of ψ in φ is negative, then φ[(ψ, i)/⊥] φ and φ φ[(ψ, i)/]; (ii) if the ith
occurrence of ψ in φ is positive, then φ[(ψ, i)/] φ and φ φ[(ψ, i)/⊥].

Proof. We here consider only the case of φ[(ψ, i)/⊥] φ when the considered
occurrence is negative and the case of φ φ[(ψ, i)/⊥] when the considered
occurrence is positive, the other case being similar. The proof is by mutual
induction on the value of s(φ). If s(φ) = 1, then φ is a propositional variable
or a constant, and as a consequence, φ[(ψ, i)/⊥] = ⊥ holds in the case where
the ith occurrence of ψ in φ is negative. Thus, we obtain φ[(ψ, i)/⊥] = ⊥ φ.
Moreover, there is no positive subformula occurrence in this case. Assume now
that s(φ) > 1. Then, φ has one of the following forms ¬φ φ1 ∧ φ2 , φ1 ∨ φ2
and φ1 → φ2 . Consider first the case φ = ¬φ the proof is trivial in the case
ψ = φ. If the ith occurrence of ψ in φ is negative, then it is positive in φ ,
and using the induction hypothesis, φ φ [(ψ, i)/⊥] holds. Thus, we obtain
¬φ [(ψ, i)/⊥] = φ[(ψ, i)/⊥] ¬φ = φ. The case where the ith occurrence of ψ
in φ is positive is similar. The proof in the remaining cases can be obtained by
simple application of the induction hypothesis, except the case φ1 → φ2 , which
is similar to that of ¬φ .

For instance, a direct consequence of Proposition 1 is the fact that Ihs satis-
fies (F orgetN egOcc) and (F orgetP osOcc). However, IM does not satisfy these
Measuring Inconsistency Through Subformula Forgetting 189

postulates. Indeed, consider K = {p∧¬p, p∧q, p∧r}. We clearly have IM (K) = 1


since there is a single MIS, which is {p ∧ ¬p}, but IM ({ ∧ ¬p, p ∧ q, p ∧ r}) = 2
since there are two MISes { ∧ ¬p, p ∧ q} and { ∧ ¬p, p ∧ r}.
We now introduce a rationality postulate, named (F orgetSubf ormula), that
is based on forgetting all the occurrences of a subformula. Before that, let us
introduce a notational convention. Given a knowledge
 base K and a subformula
ψ ∈ SF (φ) with φ ∈ K, K[ψ ↓] denotes φ∈K φ[(ψ)− , (ψ)+ /, ⊥]. In other
words, K[ψ ↓] is used to denote that all the occurrences of ψ are forgotten to
restore consistency.
The postulate (F orgetSubf ormula) is defined as follows: ∀K ∈ KForm and
∀ψ ∈ SF (K), I(K[ψ ↓])  I(K). It is clearly weaker than the previous postulates
and expresses simply that the amount of inconsistency does not decrease by for-
getting any subformula. This postulate can be used instead of (F orgetN egOcc)
and (F orgetP osOcc) in the case where no distinction is made between the occur-
rences of any subformula.

5 Forgetting Based Inconsistency Measures


In this section, we define several inconsistency measures that are based on forget-
ting subformula occurrences. We show in particular that two of these measures
correspond to Idhit and If orget described previously.
#
The first inconsistency measure, denoted Iosf , is defined as the minimum
number of subformula occurrences that have to be forgotten to restore consis-
tency. It is formally defined as follows:
# n
Iosf ({φ1 , . . . , φn }) = min{ i=1 li | {φ1 [(ψ11 , j11 ), . . . , (ψl11 , jl11 )/ C11 , . . . , Cl11 ]} ∪
· · · ∪ {φn [(ψ1n , j1n ), . . . , (ψlnn , jlnn )/C1n , . . . , Clnn ]}  ⊥ with Cij ∈ {, ⊥}}.
s
The second inconsistency measure, denoted Iosf , is defined in the same way
#
as Iosf , but it takes into account the sizes of the forgotten subformula occur-
n li
rences: Iosf s
({φ1 , . . . , φn }) = min{ i=1 k=1 s(ψki ) | {φ1 [(ψ11 , j11 ), . . . , (ψl11 , jl11 )/
C1 , . . . , Cl1 ]} ∪ · · · ∪ {φn [(ψ1 , j1 ), . . . , (ψln , jln )/C1n , . . . , Clnn ]}  ⊥ with Cij ∈
1 1 n n n n

{, ⊥}}. The measure Iosf s


relates the effort needed to restore consistency to
the size of the considered subformula occurrences instead of their number as in
#
Iosf .
s,1
The third inconsistency measure, denoted Iosf , takes also into account the
sizes of the forgotten subformula occurrences, with the additional requirement
that there is at most one forgotten occurrence n in every formula in the knowl-
s,1
edge base: Iosf ({φ1 , . . . , φn }) = min{ i=1 s(ψi ) | {φ1 [(ψ1 , j1 )/C1 ]} ∪ · · · ∪
s,1
{φn [(ψn , jn )/Cn ]}  ⊥ with C1 , . . . , Cn ∈ {, ⊥}}. The measure Iosf captures
the fact that if we need to forget two disjoint subformula occurrences ψ and ψ 
in the same formula φ to restore consistency, then we have to forget the small-
est subformula occurrence in φ containing both ψ and ψ  . This measure allows
considering the relationship between occurrences forgotten in the same piece of
information.
190 Y. Salhi

For the sake of illustration, consider the base K = {p ∧ q, ¬p ∧ ¬q}. Then,


s
we clearly have Iosf (K) = 2 since we only need to forget the first occurrences
s,1
of p and q to restore consistency. However, Iosf (K) = 3 since we need to forget
the entire formula p ∧ q to forget its subformulas p and q. Compared to Iosf s
in
s,1
this case, we also consider in Iosf the fact that the first occurrences of p and q
are related with conjunction.
The three following inconsistency measures can be seen as variants of the pre-
vious ones by considering subformulas instead of subformula occurrences. These
measures can be used in the contexts where no distinction is made between
the occurrences of a subformula with regard to the amount of inconsistency.
#
For instance, the inconsistency measure denoted Isf is defined as the minimum
number of subformulas that have to be forgotten to restore consistency. Thus,
forgetting any subformula once or more does not change the amount of incon-
sistency.
#
Isf (K) = min{m ∈ N | K[ψ1 ↓] · · · [ψm ↓]  ⊥}
m
Isf (K) = min{ i=1 s(ψi ) | K[ψ1 ↓] · · · [ψm ↓]  ⊥
s

with s(ψ1 ) = . . . = s(ψm ) = 1}


s,1 
Isf ({φ1 , . . . , φn }) = min{ χ∈n {ψi } s(χ) | {φ1 [(ψ1 , j1 )/C1 ]} ∪ · · · ∪
i=1
{φn [(ψn , jn )/Cn ]}  ⊥with C1 , . . . , Cn ∈ {, ⊥}}
One can easily see that all the previous measures satisfy the two postulates
(Consistency) and (M onotonicity), and as a consequence, they are inconsis-
tency measures with respect to Definition 1. Further, from their definitions, it
is clear that they also satisfy the rationality postulates (F orgetN egOcc) and
(F orgetP osOcc).
#
In the following proposition, we have the fact that Iosf and Idhit are the same,
s
and in addition Iosf (K) = If orget (K) for every constant free knowledge base K.
Proposition 2. The following properties are satisfied:
#
1. Iosf (K) = Idhit (K);
n
2. Iosf ({φ1 , . . . , φn }) = min{ i=1 li | {φ1 [(ψ11 , j11 ), . . . , (ψl11 , jl11 )/C11 , . . . , Cl11 ]}
s

∪ · · · ∪ {φn [(ψ1n , j1n ), . . . , (ψlnn , jlnn )/C1n , . . . , Clnn ]}  ⊥


with C11 , . . . , Clnn ∈ {, ⊥} and s(ψ11 ) = . . . = s(ψlnn ) = 1}.

6 Conclusion and Perspectives


We have proposed an approach for measuring inconsistency that takes into
account in a syntactic way the internal structure of the formulas, which is based
on forgetting subformula occurrences to restore consistency. As a future work, we
intend to investigate the possibility to consider more rationality postulates that
consider the internal structure in a syntactic way. The aim of such postulates is
to capture other interesting links between inconsistency and the notion of subfor-
mula occurrence. We also plan to propose inconsistency measures that combine
the subformula forgetting based approach with other syntactic approaches, such
as those based on minimal inconsistent subsets.
Measuring Inconsistency Through Subformula Forgetting 191

References
1. Besnard, P.: Revisiting postulates for inconsistency measures. In: Fermé, E., Leite,
J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 383–396. Springer, Cham
(2014). https://doi.org/10.1007/978-3-319-11558-0 27
2. Besnard, P.: Forgetting-based inconsistency measure. In: Schockaert, S., Senellart,
P. (eds.) SUM 2016. LNCS (LNAI), vol. 9858, pp. 331–337. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-45856-4 23
3. Bona, G.D., Grant, J., Hunter, A., Konieczny, S.: Towards a unified framework
for syntactic inconsistency measures. In: Proceedings of the Thirty-Second AAAI
Conference on Artificial Intelligence, New Orleans, Louisiana, USA (2018)
4. Chen, Q., Zhang, C., Zhang, S.: A verification model for electronic transaction
protocols. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS,
vol. 3007, pp. 824–833. Springer, Heidelberg (2004). https://doi.org/10.1007/978-
3-540-24655-8 90
5. Condotta, J., Raddaoui, B., Salhi, Y.: Quantifying conflicts for spatial and tempo-
ral information. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the Fifteenth International Conference, KR 2016, Cape Town, South
Africa, 25–29 April 2016, pp. 443–452 (2016)
6. Grant, J., Hunter, A.: Measuring inconsistency in knowledgebases. J. Intell. Inf.
Syst. 27(2), 159–184 (2006)
7. Grant, J., Hunter, A.: Distance-based measures of inconsistency. In: van der Gaag,
L.C. (ed.) ECSQARU 2013. LNCS (LNAI), vol. 7958, pp. 230–241. Springer, Hei-
delberg (2013). https://doi.org/10.1007/978-3-642-39091-3 20
8. Hunter, A., Konieczny, S.: Measuring inconsistency through minimal inconsistent
sets. In: Principles of Knowledge Representation and Reasoning: Proceedings of the
Eleventh International Conference, KR 2008, Sydney, Australia, 16–19 September
2008, pp. 358–366. AAAI Press (2008)
9. Hunter, A., Konieczny, S.: On the measure of conflicts: shapley inconsistency val-
ues. Artif. Intell. 174(14), 1007–1026 (2010)
10. Hunter, A., Parsons, S., Wooldridge, M.: Measuring inconsistency in multi-agent
systems. Kunstliche Intelligenz 28, 169–178 (2014)
11. Lang, J., Marquis, P.: Resolving inconsistencies by variable forgetting. In: Proceed-
ings of the Eights International Conference on Principles and Knowledge Represen-
tation and Reasoning (KR-02), Toulouse, France, 22–25 April 2002, pp. 239–250
(2002)
12. Lang, J., Marquis, P.: Reasoning under inconsistency: a forgetting-based approach.
Artif. Intell. 174(12–13), 799–823 (2010)
13. Martinez, M.V., Pugliese, A., Simari, G.I., Subrahmanian, V.S., Prade, H.: How
dirty is your relational database? An axiomatic approach. In: Mellouli, K. (ed.)
ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 103–114. Springer, Heidelberg
(2007). https://doi.org/10.1007/978-3-540-75256-1 12
14. Thimm, M.: On the expressivity of inconsistency measures. Artif. Intell. 234, 120–
151 (2016)
15. Thimm, M.: On the evaluation of inconsistency measures. In: Grant, J., Martinez,
M.V. (eds.) Measuring Inconsistency in Information, Volume 73 of Studies in Logic.
College Publications, February 2018
Explaining Hierarchical Multi-linear Models

Christophe Labreuche(B)

Thales Research & Technology, Palaiseau, France


[email protected]

Abstract. We are interested in the explanation of the solution to a hierarchical


multi-criteria decision aiding problem. We extend a previous approach in which
the explanation amounts to identifying the most influential criteria in a decision.
This is based on an influence index which extends the Shapley value on trees. The
contribution of this paper is twofold. First, we show that the computation of the
influence grows linearly and not exponentially with the depth of the tree for the
multi-linear model. Secondly, we are interested in the case where the values of
the alternatives are imprecise on the criteria. The influence indices become thus
imprecise. An efficient computation approach is proposed for the multi-linear
model.

1 Introduction
One of the major challenges of Artificial Intelligence (AI) methods is to explain their
predictions and make them transparent for the user. The explanations can take very
different forms depending on the area. For instance, in Computer Vision, one is inter-
ested in identifying the salient factors explaining the classification of an image [12]. In
Machine Learning, one might look for the smallest modification to make on an instance
to change its class (counter-factual example) [16]. In Constraint Programming, the aim
is to find the simplest way to repair a set of inconsistent constraints [8]. And so on. There
is thus a variety of explanation methods applicable to a wide range of AI methods.
Many decision problems involve multiple attributes to be taken into account. Multi-
Criteria Decision Aiding (MCDA) aims at representing the preferences of a decision
maker regarding options on the basis of multiple and conflicting criteria. In real appli-
cations, one shall use elaborate decision models able to capture complex expertise. A
few models have been shown to have this ability, such as the Choquet integral [3], the
multi-linear model [11] or the Generalized Additive Independence (GAI) model [1, 6].
The main asset of these models is their ability to represent interacting criteria. The
multi-linear model is especially important as it is the most natural multi-dimensional
interpolation model. It is very smooth and does not have discontinuity of the Gradient
that the Choquet integral has. The following example illustrates applications in which
such models are important.

Example 1 (Example 1 in [9]). The DM is a Tactical Operator of an aircraft aiming


at Maritime Patrol. It consists in monitoring a maritime area and in particular looking
for illegal activity. The DM is helped by an automated system that evaluates in real
time a Priority Level (PL) associated to each ship in this area. The higher the PL the

c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 192–206, 2019.
https://doi.org/10.1007/978-3-030-35514-2_15
Explaining Hierarchical Multi-linear Models 193

10

1 7 9

2 3 8 6

4 5

Fig. 1. Hierarchy of criteria for Example 2.

more suspicious a ship and the more urgent it is to intercept it. The PL is used to raise
the attention of the DM on some specific ships. The computation of the PL depends
on several criteria: 1. Incoherence between Automatic Identification System (AIS) data
and radar detection; 2. Suspicion of drug smuggling on the ship; 3. Suspicion of human
smuggling on the ship; 4. Current speed (since fast boats are often used to avoid being
easily intercepted); 5. Maximum speed since the first detection of the ship (it represents
the urgency for the potential interception); 6. Proximity of the ship to the shore (since
smuggling ships often aim at reaching the shore as fast as possible). 

In the previous example, as in most real-applications, the criteria are not considered
in a flat way but are organized as a tree. The criteria are indeed organized hierarchically
with several nested aggregation functions. The hierarchical structure shall represent the
natural decomposition of the decision reasoning into points of view and sub-points of
view. In the previous example, the six criteria are organized as in Fig. 1. The tree of
the DM contains four aggregation nodes: 7. Suspicion of illegal activity; 8. Kinematics;
9. Capability to escape interception; 10. Overall PL.
The ability to explain the evaluation is very important in Example 2. If the PL of
a ship suddenly increases over time, the tactical operator needs to understand where
this comes from. This latter is under stress and time pressure. He is thus looking for
an explanation highlighting the most influencing attributes in the evolution of the PL.
This type of explanation has been recently widely studied under the name of feature
attribution. The aim is to attribute to each feature its level of contribution. Among the
many concepts that have been proposed, the Shapley value has been widely used in
Machine Learning [4, 10].
The Shapley value has also been recently as an explanation means in MCDA [9].
In this reference, a new explanation approach for hierarchical MCDA models has been
introduced. The idea is to highlight the criteria that contribute most to the decision.
In Example 2, consider two ships represented by two alternatives x and y taking the
following values on the six attributes x = (x1 , x2 , x3 , x4 , x5 , x6 ) = (+, −, −, −, −, −)
and y = (+, +, +, +, +, +) (where values ‘+’ and ‘−’ indicate a high and low value
respectively). The type of explanation that is sought can typically be that the nodes
contributing the most to the preference of y over x are nodes 8 (Kinematics) and 9
(Capability to escape interception) and not 2 (Suspicion of drug smuggling on the ship)
or 3 (Suspicion of human smuggling on the ship). This helps the user to further analyze
the values of criteria 8 and 9 (and not criteria 2 or 3). To this end, an indicator measuring
194 C. Labreuche

the degree to which a node contributes to the preference between two alternatives has
been defined in Ref. [9]. It is a generalization of the Shapley value on trees.
The contribution of this paper is to further develop this approach in two directions.
We are interested in the practical computation of the influence indicator. The main
drawback of the Shapley value is that it has an exponential complexity in the num-
ber of nodes. It has been shown in Ref. [9] that the influence index for a node can be
equivalently be computed on a subtree. The first contribution of this paper is to rewrite
the influence index so as to improve the computational complexity. It cannot be fur-
ther reduced without making assumptions on the utility model. An illustration of the
influence indicator to the Choquet integral has been proposed in Ref. [9]. We consider
in this paper another important class of aggregation model, based on the multi-linear
extension. One of the main result of this paper shows that for the multi-linear model,
the computations can be performed independently on each aggregation node, making
the computation of the influence index much more tractable (see Sect. 5.2).
In practice, the values of the alternatives on the attributes are imprecise (second
direction of this work). In Example 2, one needs to assess the PL of faraway ships for
which the values of some attributes are not precisely known. In particular, the attributes
related to the intent of the ship cannot readily be determined. Other attributes such as
the heading of a ship cannot be assigned to a precise value as it is a fluctuating vari-
able. The imprecision of the values of the attributes can also come from some disagree-
ment among experts opinions (for attributes corresponding to a subjective judgment).
For numerical attributes, the imprecise value can take the form of an interval. So far,
there is no explanation approach able to capture imprecise values of the alternatives. In
Example 2, the values of a ship on numerical attributes such as the maximum speed or
the proximity of the shore might be given as an interval of confidence. The imprecisions
on the value of the alternatives on the attributes propagate to the influence degrees in
a very complex manner. We show that when the aggregation models are multi-linear
models, the computation of the bounds on the influence degree can be easily obtained
(see Sect. 4).

2 Preference Model and Notations


2.1 MCDA Model
We are given a set of criteria N = {1, . . . , n}, each criterion i ∈ N being associated
with an attribute Xi , either discrete or continuous. The alternatives are characterized by
a value on each attribute and are thus associated to an element in X = X1 × · · · × Xn .
We assume that the preferences of the DM over the alternatives are represented by a
utility model U : X → R.
The hierarchy of criteria is represented by a rooted tree T , defined by the set of
nodes MT (i.e. the set of criteria and aggregation nodes), and the children ChT (l) of
node l (i.e. the nodes that are aggregated at each node l) [5]. We also denote by NT ⊆
MT the set of leaves of tree T (i.e. the criteria), by sT ∈ MT the root of tree T (i.e. the
top aggregation node), by ChT (l) the children of node l in T , by DescT (l) the set of
descendants of l, and by Leaf T (l) the leaves at or below l ∈ MT . A hierarchical model
on criteria N is such that NT = N .
Explaining Hierarchical Multi-linear Models 195

The preference model is composed of an aggregation function Hl at each node


l ∈ MT \ NT and a partial utility function ui for each criterion i ∈ NT (criteria). For
x ∈ X, we can compute U (x) recursively from a function viU defined at each node
i ∈ MT :
– viU (x) = ui (x
 i ) for every leaf i ∈ NT ,
– vlU (x) = Hl (vkU (x))k∈ChT (l) for every aggregation node l ∈ MT \ NT ,
– U (x) = vsUT (x) is the overall utility.
Example 2 (Example 2 cont.). We have
viU (x) = ui (xi ) for i ∈ {1, 2, 3, 4, 5, 6},
v7U (x) = H7 (v2U (x), v3U (x)) , v8U (x) = H8 (v4U (x), v5U (x)),
v9U (x) = H9 (v6U (x), v8U (x)) , U (x) = v10
U
(x) = H10 (v1U (x), v7U (x), v9U (x)). 

2.2 Shapley Value


In Cooperative Game Theory, a game on N is a set function v : 2N → R such that
v(∅) = 0, N is the set of players, and v(S) (for S ⊆ N ) is the amount of wealth pro-
duced by S when they cooperate. It is a non-normalized capacity. The Shapley value is
a fair share of the global wealth v(N ) produced by all players together, among them-
selves [14]:
 (n − |S| − 1)!|S|!  
i (N, v) :=
φSh v(S ∪ {i}) − v(S) . (1)
n!
S⊆N \i

It can also be written as an average over the permutation on N :


1   
i (N, v) := n
φSh v(Sπ (i)) − v(Sπ (i) \ {i}) , (2)
2
π∈Π(N )

where Sπ (π(k)) := {π(1), . . . , π(k)} and Π(N ) is the set of permutations on N .

2.3 Influence Index


Consider two alternatives x and y in X. One wishes to explain the reasons of the dif-
ference of preference between x and y. The explanation proposed in Ref. [9] takes the
form of an index measuring the degree to which each node in MT contributes to the dif-
ference of preference between x and y. An influence index denoted by Ii (x, y; U, T ) is
computed for each node i ∈ MT for utility model U on the hierarchy T of criteria. The
influence index is some kind of Shapley value applied to the game v(S) = U (yS , xN \S )
for all S ⊆ N , where (yS , xN \S ) denotes an alternative taking the values of y in S and
the values of x in N \ S. As for the Shapley value, it is defined from permutations on
N . Its expression is defined by [9]:
 
1
π∈Π(T ) δπ
x,y,T,U
(i) if i ∈ NT ,
Ii (x, y, T, U ) = 
|Π(T )|
(3)
k∈Leaf T (i) Ik (x, y, T, U ) else,
196 C. Labreuche

where δπx,y,T,U (i) := U (ySπ (i) , xN \Sπ (i) ) − U (ySπ (i)\{i} , x(N \Sπ (i))∪{i} ). In (3), the
set of admissible orderings Π(T ) is defined as the set of orderings of elements of N for
which all elements of a subtree of T are consecutive. More precisely, π ∈ Π(T ) iff, for
every l ∈ MT \ N , indices π −1 (Leaf T (l)) are consecutive.

2.4 Influence Index of the Restricted Tree


The complexity of computing Ii is equal to |Π(T )|, which is far too large. It has been
shown in Ref. [9] that one can reduce this complexity by taking profit of some symme-
tries among permutations in Π(T ). The symmetries can be seen considering subtrees
of T . We consider a subtree T  of T having the same root as T , taking a subset of nodes
of T and having the same edges than T between nodes that are kept.
Definition of UT  : Given ((ui )i∈NT , (Hi )i∈MT \NT ) and a subtree T  of T , we can
define ((ui )i∈NT  , (Hi )i∈MT  \NT  ) by ui = ui for i ∈ NT  ∩ NT , ui (xi ) = xi for
i ∈ NT  \ NT and Hi = Hi for i ∈ MT  \ NT  . The overall utility on the sub-
tree is denoted by UT  . We set Xi = R for every i ∈ MT \ NT . Then for x ∈ X,
  
U (x) = UT  (xT ) where xT ∈ XT  is defined by xTi = xi if i ∈ NT  ∩ NT and

xTi = viU (x) otherwise.
Definition of T[j] : A particular subtree is when a node j ∈ MT of T becomes a leaf, and
thus all descendants of j are encapsulated and represented by j. We define the restricted
tree T[j] by MT[j] := (MT \ DescT (j))∪{j}, NT[j] := (NT \ Leaf T (j))∪{j}, sT[j] :=
sT , and ChT[j] (l) = ChT (l) for all l ∈ MT[j] \ NT[j] .
 

Definition of T[J] : For J = {j1 , . . . , jp }, we set T[J] := (T )[j1 ] [j2 ] · · · .


[jp ]
Let us thus consider Ii for some fixed i ∈ N . The path from sT to i in T consists of
t−1
the nodes r0 = sT , r1 , . . ., rt = i. Let J = l=1 ChT (rl−1 ) \ {rl }. Then we have [9]

Ii (x, y, T, U ) = Ii (xT[J] , y T[J] , T[J] , UT[J] ). (4)

The influence index can be equivalently be computed on the restricted tree T[J] .

3 Generic Complexity Reduction of Ii (x, y; U, T )

Our aim is to implement the influence index in practice. The influence index contains
an exponential number of terms. It is thus very challenging to perform its exact compu-
tation. A complexity analysis is performed in Sect. 3.1. An alternative expression of the
influence index, reducing its computational complexity is proposed in Sect. 3.2.

3.1 Complexity Analysis

By Sect. 2.3, the expression of the influence index is given by (3). Hence the complexity
of Ii (x, y; U, T ) depends on the number of permutations Π(T ). For j ∈ MT \ N ,
we denote by T|j the subtree of T starting at node j, defined by MT|j := DescT (j),
NT|j := Leaf T (j), sT|j := j, and ChT|j (l) = ChT (l) for all l ∈ MT|j \ NT|j . Then the
cardinality of Π(T ) can be recursively computed thanks to the next result.
Explaining Hierarchical Multi-linear Models 197

Lemma 1. |Π(T )| = |ChT (sT )|! × |Π(T|j )|.
j∈ChT (sT )

The proofs of this result and the others are omitted for space limitation.
Lemma 3 provides a recursive formula to compute the number of compatible per-
mutations in a tree T , that is the complexity of Ii (x, y, T, U ).
By (4), the extended Owen value of node i for tree T can be computed equiva-
lently on tree T[J] . The implementation of these formulae requires to enumerate over
the permutations Π(T[J] ). This helps to drastically reduce the complexity.

Example 3. For T of Fig. 2(left), i = 1, we obtain J = {2, 10, 14}. Figure 2(right)
shows T[J] . 

15
15
13 14
=⇒ 13 14
9 10 11 12
9 10
1 2 3 4 5 6 7 8
1 2

Fig. 2. Trees T (left) and T[J] (right), J = {2, 10, 14}.

In order to demonstrate the gain obtained by using T[J] instead of T , let us take
the example of uniform trees, denoted by Td,p Un
(with d, p ∈ N∗ ) where each aggrega-
tion node contains exactly p children and each leaf is exactly at depth d of the root.
Un
Figure 2(left) illustrates T3,2 . The next lemma gives the expression of the number of
Un
permutations associated to the uniform tree Td,p .
d−1 k

d Un k=0 p
d
Lemma 2. n = NTd,p Un = p , Π(T
d,p ) = (p!) , and Π (T Un
d,p )[J] = (p!) .

Table 1 below shows a clear benefit of using T[J] instead of T in the computation of
the influence index: the ratio amounts to orders of magnitude when n increases.

3.2 Alternative Expression of Ii (x, y; U, T )

Expression (3) takes the form of an average over permutations. The number of terms in
the sum in (3) is equal to C(N ) := 2|N |−1 . We give in this section an equivalent new
expression taking profit of relation (4).
Consider Ii for some fixed i ∈ N . We set Vl := ChT (rl−1 ) for all l ∈ {1, . . . , t},
Vl := Vl \ {rl } – see Fig. 3.
198 C. Labreuche

Expression (3) can be turned into a sum over coalitions, which reduces a little bit
the computation complexity:

r0

r1 ◦ ◦ V1

.. ◦
.

rl−1 ◦ ◦

rl ◦ ◦ ◦ ◦ Vl

.. ◦
.

rt ◦ ◦ ◦ Vt

Fig. 3. Illustration of notation rl and Vl .

Theorem 1. We have
t  

  l=1 |Sl |! × |Vl | − |Sl | − 1 !


Ii (x, y, U, T[J] ) = ··· t (5)
S1 ⊆V1 St ⊆Vt l=1 |Vl |!
 
× U (yi , [yS1..t x]V1..t
 ) − U (xi , [yS1..t x]V1..t
 ) ,


where Sl..j = Sl ∪ · · · ∪ Sj , Vl..j = Vl ∪ · · · ∪ Vj , xk = vkU (x), yk = vkU (y) and
[yS x]T (for S ⊆ T ) denotes an alternative taking the value of y in S and the value of x
in T \ S.
The computation complexity of (5) is given by the next result.
t 
Lemma 3. The number of terms in (5) is of order C(T ) := l=1 2|Vl | .
The last two column in Table 1 presents the log of the number of operations in the
expression of the influence index written over coalitions rather than on permutations.
The complexity of computing the influence index reducing (resp. not reducing) to the
Un
restricted tree is denoted by C(Td,p ) (resp. C(NTd,p
Un )).

We obtain significant improvements on the computation time. In the second part


of the paper, we will aim at drastically reducing this complexity – going from expo-
nential complexity to polynomial or even linear – by taking an appropriate family of
hierarchical aggregation models.
Explaining Hierarchical Multi-linear Models 199

Un
Table 1. Logarithm of the number of permutations and subsets for uniform trees Td,p .

d p n Expression (3) Expression (5)


Un
log10 |Π(N )| log10 |Π(T )| log10 Π(T[J] ) log10 C(NT Un ) log10 C(Td,p )
d,p

3 3 27 28.036 10.115 2.334 8.128 1.806


3 4 64 89.1 28.984 4.14 19.266 2.709
3 5 125 209.27 64.454 6.237 37.629 3.612
3 6 216 412.0 122.86 8.571 65.022 4.515
4 3 81 120.76 31.126 3.112 24.383 2.408
4 4 256 506.93 117.31 5.520 77.063 3.612
4 5 625 1477.7 324.35 8.316 188.14 4.816
4 6 1296 3473.0 740.04 11.429 390.13 6.021
5 3 243 475.76 94.156 3.89 73.15 3.010
5 4 1024 2639.7 470.65 6.901 308.2 4.515
5 5 3125 9566.3 1623.84 10.395 940.7 6.021
5 6 7776 26879 4443.15 14.286 2340.9 7.526

4 Computation of the Influence Index with Imprecise Values

In many practical situations, the values of the alternatives are imprecise. We have justi-
fied this in the introduction, in particular for Example 2. For the sake of simplicity, the
imprecision of the two alternatives on which the explanation is computed are given as
intervals: x = [x, x] and y = [y, y], with x, x, y, y ∈ X. The problem is to define the
influence index between x  and y.
The idea is to propagate the imprecisions on the values of x and y on the computa-
tion of the influence index. The influence of node i in the comparison between x  and y
is a closed interval defined by
 
Ii (
x, y, T, U ) = Ii (
x, y, T, U ), Ii (
x, y, T, U ) ,

where

Ii (
x, y, T, U ) = min min Ii (x, y, T, U ),
x∈
x y∈
y

Ii (
x, y, T, U ) = max max Ii (x, y, T, U ).
x∈
x y∈
y
200 C. Labreuche

We have

Ii (
x, y, T, U ) = min min Ii (x, y, T, U )
x∈
x y∈
y
1   
= min min U (yi , ySπ (i)\{i} , x−Sπ (i) ) − U (xi , ySπ (i)\{i} , x−Sπ (i) )
x∈
x y∈
y |Π(T )|
π∈Π(T )
1  
= min min U (y i , ySπ (i)\{i} , x−Sπ (i) )
x−i ∈
x−i y−i ∈
y−i |Π(T )|
π∈Π(T )

− U (xi , ySπ (i)\{i} , x−Sπ (i) ) ,

and

Ii (
x, y, T, U ) = max max Ii (x, y, T, U )
x∈
x y∈
y
1   
= max max U (yi , ySπ (i)\{i} , x−Sπ (i) ) − U (xi , ySπ (i)\{i} , x−Sπ (i) )
x∈ y |Π(T )|
x y∈
π∈Π(T )
1  
= max max U (y i , ySπ (i)\{i} , x−Sπ (i) )
x−i ∈
x−i y−i ∈
y−i |Π(T )|
π∈Π(T )

− U (xi , ySπ (i)\{i} , x−Sπ (i) ) .

In the general case, computing Ii (


x, y, T, U ) or Ii (
x, y, T, U ) is difficult. We will
show in the next section that these computations become tractable for the multi-linear
model.

5 Case of the Multi-linear Model


Section 3.2 has provided an improved expression of the influence index reducing its
computation complexity. However, it is still exponential in the number of criteria and
the depth of the tree. We cannot further reduce the computation complexity without
making assumptions on the utility model U . For applications requiring real-time com-
putations of the explanations and/or presenting a large tree of criteria, we need to restrict
ourselves to classes of models U having specific properties allowing to break the expo-
nential complexity of the computation. This can be easily obtained considering very
simple aggregation models. For example, if all aggregation models in the tree are sim-
ple weighted sums
  
vlU (x) = Hl (vkU (x))k∈ChT (l) = wl (k) vkU (x), (6)
k∈ChT (l)

where wl (k) is the weight of node k at aggregation ode l, then one can easily show that


t−1
Ii (x, y; T ; U ) = (ui (yi ) − ui (xi )) wrl (rl+1 ). (7)
l=0
Explaining Hierarchical Multi-linear Models 201

Even though the complexity of computing Ii (x, y; T ; U ) is linear in the depth of the
tree, the underlying model is very simple and far from being able to capture real-life
preferences.
We are thus looking for a decision model realizing a good compromise between
a high representation power (in particular being able to capture interaction among
attributes) and a low computation time for the influence indices. We explore in this paper
the multi-linear model and believe that it realizes such good compromise. Section 5.1
describes the multi-linear model. Section 5.2 shows that the expression of the influence
index for the multi-linear model can be drastically simplified in terms of computational
complexity. Section 5.3 shows that when the values of the alternatives are uncertain, the
computation of the influence is also tractable for the multi-linear model.

5.1 Multi-linear Model


Consider an aggregation node l ∈ MT \ N , which children are ChT (l). For the sake
of simplicity, we assume that the components that are aggregated by Hl are simply
denoted by the vector a = (a1 , . . . , anl ), with nl = |ChT (l)|.
There exists many aggregation functions [2, 7]. The simplest one is the weighted
sum (see (6)):
nl
WS(a) = wl (i) ai ,
i=1

where wl (i) is the weight assigned to node i. This model assumes the independence
among the criteria.
Without loss of generality, we can assume that the score lies in interval [0, 1] where
0 (resp. 1) means the criterion is not satisfied at all (resp. completely satisfied). In order
to represent interaction among criteria, the idea is to assign weights not only to single
criteria but also to subsets of criteria. A capacity (also called fuzzy measure [15]) is a set
function vl : 2nl → [0, 1] such that vl (∅) = 0, vl ({1, . . . , nl }) = 1 and vl (S) ≤ vl (T )
whenever S ⊆ T [3]. Term vl (S) represents the aggregated score of an option being
very well-satisfied on criteria S (with score 1) and very ill-satisfied on the other criteria
(with score 0).
The Möbius transform of vl , denoted by ml : 2nl → R, is given by [13]

ml (A) = (−1)|A\B| vl (B).
B⊆A

A capacity is said to 2-additive if the Möbius coefficients are zero for all subsets of three
or more terms. Two classical aggregation functions can be obtained given the Möbius
coefficients ml . The first one is the Choquet integral [3]

Cml (a) = ml (T ) × min am ,
m∈T
T ∈Sl

whereas the second one is the multi-linear model



Mml (a) = ml (T ) × am , (8)
T ∈Sl m∈T
202 C. Labreuche

where Sj is the subset of {1, . . . , nl } on which the Möbius coefficients are non-null.
The next example illustrates the multi-linear model w.r.t. a two-additive capacity.

Example 4 (Example 2 cont.). After eliciting the tactical operator preferences, the
aggregation functions are given by:
Node 7: There is suspicion of illegal activity whenever either drug or human smug-
gling is detected. Hence there is redundancy between criteria 2 and 3. As human
smuggling (crit. 3) is slightly more important than criterion 2, we obtain v7U (x) =
0.8 v3U (x) + v3U (x) − 0.8 v2U (x) × v3U (x);
Node 8: v8U (x) = (v4U (x) + v5U (x))/2;
Node 9: Nodes 6 and 8 are redundant, since there is a high risk that the ship escapes
interception when it is either close to the shore (crit. 6) or very fast (node 8). Hence
v9U (x) = 0.8 v6U (x) + 0.8 v8U (x) − 0.6 v6U (x) × v8U (x),
Node 10: Nodes 1 and 7 are redundant since there is a suspicion on the ship when
the score is high on either node 1 or 7. Nodes 7 and 9 are complementary as the
risk is not so high for a suspicious ship (high value at node 7) that is easy to inter-
cept (low value at node 9), or for a ship that is difficult to intercept but that is not
suspicious. We have the same behavior between nodes 1 and 9. Hence  v10 (x) =
U

v1U (x) + v7U (x) − v1U (x) × v7U (x) + v1U (x) × v9U (x) + v7U (x) × v9U (x) /3.
For x = (+, −, −, +, +, +), we obtain u2 (x) = u3 (x) = 0, ui (x) = 1 for i ∈
{1, 4, 5, 6}, v7U (x) = 0, v8U (x) = v9U (x) = 1 and U (x) = v10 U
(x) = 23 . 

5.2 Expression of the Influence Index for the Multi-linear Model

We consider the case where all aggregations functions are multi-linear models.
We now give the main result of this paper.

Theorem 2. Assume that the aggregation function at node rl (for l ∈ {0, . . . , t − 1})
is done with a multi-linear extension w.r.t. Möbius coefficients mrl . Then


t−1
Ii (x, y; U, T[J] ) = (ui (yi ) − ui (xi )) × Φl , (9)
l=0

where
 
Φl = mrl (T ∪ {rl+1 }) × ym × xm ×

T ⊆Vl+1 , T ∪{rl+1 }∈Sl+1 S  ⊆T m∈T ∩S  m∈T \S 
⎡ ⎤
|Vl+1 |−|T |−1
 (|Vl+1 | − |T | − 1)!   
(|S | + s )!(|Vl+1 | − |S | − s − 1)! ⎦ 
⎣ .
s !(|Vl+1 | − |T | − 1 − s )! |Vl+1 |!
s =0

In the generic expression of the influence index (see (5)), the complexity of the
computation of Ii grows exponentially with the number t of layers (see Lemma 3).
Explaining Hierarchical Multi-linear Models 203

Thanks to the previous result, one readily sees that the computation of the influence only
grows linearly with the depth of the tree for the multi-linear model. In (9), the influence
index takes the form of a product of an influence computed for each layer, where Φl
is the local influence at aggregation node rl . Hence the computation of (9) becomes
very fast, whatever the depth of the tree and the number of aggregation functions, as the
number of children at each aggregation node is small in practice (in general between
2 and 6). We note that there are strong similarities with the case of a weighted sum
– see (7). The weighted is a particular case of a multi-linear model where all Möbius
coefficients for the subsets of two or more elements are zero. In this case, Φl subsumes
to mrl ({rl+1 }), which is equal to the weight wrl (rl+1 ) of node rl+1 at aggregation
node rl in a weighted sum. Hence (9) subsumes to (7) for a weighted sum.

Lemma 4. The number of terms in (5) is of order


t  
t

CMultiLin (T ) := 1 + |Vl | × 2|T | ≤ 1 + |Vl |3|Vl | . (10)
l=1 T ⊆Vl , T ∪{rl }∈Sl l=1

If all multi-linear models are two-additive, the complexity becomes


t
CMultiLin (T ) = 1 + |Vl | [1 + 2|Vl |] .
l=1

We now illustrate Theorem 2 on the running example.

Example 5. (Example 4 cont.). We consider the two options x = (+, −, −, −, −, −)


and y = (+, +, +, +, +, +). We have

u1 (x) = 1 , u2 (x) = u3 (x) = u4 (x) = u5 (x) = u6 (x) = 0,


u1 (y) = u2 (y) = u3 (y) = u4 (y) = u5 (y) = u6 (y) = 1.

Then the influence of node say 4 is equal to

I4 (x, y; U, T[J] ) = (u4 (y) − u4 (x)) × Φ0 × Φ1 × Φ2 ,

where Φl is the contribution at aggregation node rl to the influence. We have

u1 (x) + u1 (y) u7 (x) + u7 (y)


Φ0 = m10 ({9}) + m10 ({1, 9}) + m10 ({7, 9})
2 2
(as m10 ({1, 7, 9}) = 0),
u6 (x) + u6 (y)
Φ1 = m9 ({8}) + m9 ({6, 9}) ,
2
Φ2 = m8 ({4}) (as m8 ({4, 5}) = 0).

Hence Φ0 = 12 , Φ1 = 12 , Φ2 = 1
2 and I4 (x, y; U, T[J] ) = 0.125. 
204 C. Labreuche

5.3 Computation of the Influence Index with Imprecise Values for the
Multi-linear Model
As in Sect. 5.2, we now assume that all aggregation functions are multi-linear models.
From Theorem 2,


t−1
Ii (x, y; U, T[J] ) = (yi − xi ) × Φl (x, y),
l=0

where
 
 |S|! × |Vl | − |S| − 1 ! 
Φl (x, y) = × ml (T ∪ {rl })× yj × xj .

|Vl |! 
S⊆Vl T ⊆Vl j∈T ∩S j∈T \S

Let us start with the computation of the lower bound of the influence of criterion i:

Ii (
x, y, T, U ) = min min Ii (xi , x−i ), (y i , y−i ); T, U
x−i ∈
x−i y−i ∈
y−i


t
= (y i − xi ) × Φl ,
l=1

where Φl = minx−i ∈x−i miny−i ∈y−i Φl (x, y). Let k ∈ Vl . Let us analyse the mono-
tonicity of variables xk and yk on Φl (x, y):
 
   |S|! × |Vl | − |S| − 1 !
Φl (x, y) = ml (T ∪ {rl }) (11)
|Vl |!
S⊆Vl \{k} T ⊆Vl \{k}
 
(|S| + 1)! × |Vl | − |S| − 2 !
+ ml (T ∪ {rl })
|Vl |!
 
|S|! × |Vl | − |S| − 1 !
+ ml (T ∪ {rl , k}) xk
|Vl |!
  
(|S| + 1)! × |Vl | − |S| − 2 !
+ ml (T ∪ {rl , k}) yk × yj × xj .
|Vl |!
j∈T ∩S j∈T \S

The first two terms in the bracket are constant w.r.t. xk and yk . Hence Φl is linear in
xk and in yk . This implies that the minimum value in Φl (x, y) is attained at an extreme
point of the intervals. As this holds for every k, we obtain

Φl = min min Φl (x, y).


x−i ∈ j=i {xj ,xj } y−i ∈ j=i {y ,y j }
j

The optimal value can be obtained by enumerating the extreme values. This is not so
time consuming as the number of elements in Vl is not large. A similar approach can
be performed to compute Ii ( x, y, T, U ).
A more efficient approach can be derived to compute Ii ( x, y, T, U ) and
Ii (
x, y, T, U ) under assumptions on ml . By (11), if ml (T ∪ {rl , k}) ≥ 0 (resp. ≤ 0)
Explaining Hierarchical Multi-linear Models 205

for all T ⊆ Vl \ {k}, then Φl is monotonically increasing (resp. decreasing) w.r.t. xk
and yk . Hence the minimum Φl is attained at xk = xk (resp. at xk = xk ). This is in
particular the case when the Möbius coefficients are 2-additive. Indeed, for a 2-additive
capacity, m(T ∪ {rl , k}) can be non-zero only for T = ∅.

6 Conclusion and Perspectives

The problem of generating explanations is of particular importance in many applica-


tions. It is also very challenging. We have considered the problem of explaining a hier-
archical multi-criteria decision aiding problem using influence indices extending the
Shapley value. The main drawback of this approach is that its computation complexity
grows exponentially with the depth of the tree. We have shown that this complexity
remains linear when the aggregation functions are multi-linear models. Secondly, we
considered in the case where the values of the alternatives are imprecise on the crite-
ria. The influence indices become thus imprecise. An efficient computation approach is
proposed for the multi-linear model.
The work can be extended in several directions. In applications where a multi-linear
model is not suitable, it is crucial to obtain efficient algorithms for other classes of
aggregation models, such as the Choquet integral. One can also check the validity of
the explanations on real users.

References
1. Bacchus, F., Grove, A.: Graphical models for preference and utility. In: Conference on
Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, pp. 3–10, July 1995
2. Beliakov, G., Pradera, A., Calvo, T.: Aggregation Functions: A Guide for Practitioners. Stud-
ies in Fuzziness and Soft Computing, vol. 221. Springer, Heidelberg (2007). https://doi.org/
10.1007/978-3-540-73721-6
3. Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5, 131–295 (1953)
4. Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence. In:
IEEE Symposium on Security and Privacy, San Jose, CA, USA, May 2016
5. Diestel, R.: Graph Theory. Springer, New York (2005)
6. Fishburn, P.: Interdependence and additivity in multivariate, unidimensional expected utility
theory. Int. Econ. Rev. 8, 335–342 (1967)
7. Grabisch, M., Marichal, J., Mesiar, R., Pap, E.: Aggregation Functions. Cambridge Univer-
sity Press, Cambridge (2009)
8. Junker, U.: QUICKXPLAIN: preferred explanations and relaxations for over-constrained
problems. In: Proceedings of the 19th National Conference on Artificial Intelligence (AAAI
2004), San Jose, California, pp. 167–172, July 2004
9. Labreuche, C., Fossier, S.: Explaining multi-criteria decision aiding models with an extended
Shapley value. In: Proceedings of the Twenty-Seventh International Joint Conference on
Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, pp. 331–339, July 2018
10. Lundberg, S., Lee, S.: A unified approach to interpreting model predictions. In: Guyon, I.,
et al. (eds.) 31st Conference on Neural Information Processing Systems (NIPS 2017), Long
Beach, CA, USA, pp. 4768–4777 (2017)
11. Owen, G.: Multilinear extensions of games. Management Sci. 18, 64–79 (1972)
206 C. Labreuche

12. Ribeiro, M., Singh, S., Guestrin, C.: “Why Should I Trust You?”: explaining the predic-
tions of any classifier. In: KDD 2016 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, pp.
1135–1144 (2016)
13. Rota, G.: On the foundations of combinatorial theory I. Theory of Möbius functions.
Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 2, 340–368 (1964)
14. Shapley, L.S.: A value for n-person games. In: Kuhn, H.W., Tucker, A.W. (eds.) Contribu-
tions to the Theory of Games, Vol. II. Annals of Mathematics Studies, no. 28, pp. 307–317.
Princeton University Press, Princeton (1953)
15. Sugeno, M.: Fuzzy measures and fuzzy integrals. Trans. S.I.C.E. 8(2), 218–226 (1972)
16. Wachter, S., Mittelstadt, B., Russell, C.: Counterfactual explanations without opening the
black box: automated decisions and the GDPR. Harvard J. Law Technol. 31(2), 841–887
(2018)
Assertional Removed Sets Merging of DL-Lite
Knowledge Bases

Salem Benferhat1 , Zied Bouraoui1 , Odile Papini2 , and Eric Würbel2(B)


1
CRIL-CNRS UMR 8188, Univ Artois, Arras, France
{benferhat,bouraoui}@cril.univ-artois.fr
2
LIS-CNRS UMR 7020, Aix Marseille Univ, Université de Toulon, Marseille, France
{papini,wurbel}@univ-amu.fr

Abstract. DL-Lite is a tractable family of Description Logics that underlies the


OWL-QL profile of the ontology web language, which is specifically tailored for
query answering. In this paper, we consider the setting where the queried data are
provided by several and potentially conflicting sources. We propose a merging
approach, called “Assertional Removed Sets Fusion” (ARSF) for merging DL-
Lite assertional bases. This approach stems from the inconsistency minimization
principle and consists in determining the minimal subsets of assertions, called
assertional removed sets, that need to be dropped from the original assertional
bases in order to resolve conflicts between them. We give several merging strate-
gies based on different definitions of minimality criteria, and we characterize the
behaviour of these strategies with respect to rational properties. The last part of
the paper shows how to use the notion of hitting sets for computing the assertional
removed sets, and the merging outcome.

1 Introduction
In the last years, there has been an increasing use of ontologies in many application
areas including query answering, Semantic Web and information retrieval. Description
Logics (DLs) have been recognized as powerful formalisms for both representing and
reasoning about ontologies. A DL knowledge base is built upon two distinct compo-
nents: a terminological base (called TBox), representing generic knowledge about an
application domain, and an assertional base (called ABox), containing assertional facts
that instantiate terminological knowledge. Among Description Logics, a lot of attention
was given to DL-Lite [12], a lightweight family of DLs specifically tailored for appli-
cations that use huge volumes of data for which query answering is the most important
reasoning task. DL-Lite guarantees a low computational complexity of the reasoning
process.
In many practical situations, data are provided by several and potentially conflicting
sources, where getting meaningful answers to queries is challenging. While the avail-
able sources are individually consistent, gathering them together may lead to inconsis-
tency. Dealing with inconsistency in query answering has received a lot of attention
in recent years. For example, a general framework for inconsistency-tolerant semantics


c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 207–220, 2019.
https://doi.org/10.1007/978-3-030-35514-2_16
208 S. Benferhat et al.

was proposed in [4, 5]. This framework considers two key notions: modifiers and infer-
ence strategies. Inconsistency tolerant query answering is seen as made out of a modi-
fier, which transforms the original ABox into a set of repairs, i.e. subsets of the original
ABox which are consistent w.r.t. the TBox, and an inference strategy, which evaluates
queries from these repairs. Interestingly enough, such setting covers the main existing
works on inconsistency-tolerant query answering (see e.g. [2, 9, 22]). Pulling together
the data provided by available sources and then applying inconsistency-tolerant query
answering semantics provides a solution to deal with inconsistency. However, in this
case valuable information about the sources will be lost. This information is indeed
important when trying to find better strategies to deal with inconsistency during merg-
ing process.
This paper addresses query answering by merging data sources. Merging consists
in achieving a synthesis between pieces of information provided by different sources.
The aim of merging is to provide a consistent set of information, making maximum use
of the information provided by the sources while not favoring any of them. Merging
is an important issue in many fields of Artificial Intelligence [10]. Within the classical
logic setting belief merging has been studied according different standpoints. One can
distinguish model-based approaches that perform selection among the interpretations
which are the closest to original belief bases. Postulates characterizing the rational
behaviour of such merging operators, known as IC postulates, which have been pro-
posed by Revesz [25] and improved by Konieczny and Pérez [21] in the same spirit as
the seminal AGM [1] postulates for revision. Several concrete merging operators have
been proposed [11, 20, 21, 23, 26]. In contrast to model-based approaches, the formula-
based approaches perform selection on the set of formulas that are explicitly encoded
in the initial belief bases. Some of these approaches have been adapted in the con-
text of DL-Lite [13]. Falappa et al. [14] proposed a set of postulates to characterize
the behaviour of belief bases merging operators and concrete merging operators have
been proposed [6, 8, 14, 17, 19, 24]. Among these formula-based merging approaches,
Removed Sets Fusion approach has been proposed in [17, 18] for merging propositional
belief bases. This approach stems from removing a minimal subset of formulae, called
removed set, to restore consistency. The minimality in Removed Sets Fusion stems
from the operator used to perform merging, which can be the sum (Σ), the cardinal-
ity (Card), the maximum (M ax), the lexicographic ordering (GM ax). This approach
has shown interesting properties: it is not too cautious and satisfies most rational IC
postulates when extended to belief sets revision.
This paper studies DL-Lite Assertional Removed Sets Fusion (ARSF). The main
motivation in considering ARSF is to take advantage of the tractability of DL-Lite for
the merging process and the rational properties satisfied by ARSF operators. We con-
sider in particular DL-LiteR as member of the DL-Lite family, which offers a good
compromise between expressive power and computational complexity and underlies
the OWL2-QL profile. We propose several merging strategies based on different defini-
tions of minimality criterion, and we give a characterization of these merging strategies.
The last section contains algorithms based on the notion hitting sets for computing the
merging outcome.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 209

2 Background
In this paper, we only consider DL-LiteR , denoted by L, which underlies OWL2-QL.
However, results of this work can be easily generalized for several members of the
DL-Lite family (see [3] for more details about the DL-Lite family).
Syntax. A DL-Lite knowledge base K = T , A is built upon a set of atomic con-
cepts (i.e. unary predicates), a set of atomic roles (i.e. binary predicates) and a set of
individuals (i.e. constants). Complex concepts and roles are formed as follows:

B −→ A|∃R, C −→ B|¬B, R −→ P |P − , E −→ R|¬R,

where A (resp. P) is an atomic concept (resp. role). B (resp. C) are called basic (resp.
complex) concepts and roles R (resp. E) are called basic (resp. complex) roles. The
TBox T consists of a finite set of inclusion axioms between concepts of the form: B 
C and inclusion axioms between roles of the form: R  E. The ABox A consists of a
finite set of membership assertions on atomic concepts and on atomic roles of the form:
A(ai ), P (ai , aj ), where ai and aj are individuals. For the sake of simplicity, in the rest
of this paper, when there is no ambiguity we simply use DL-Lite instead of DL-LiteR .
Semantics. The DL-Lite semantics is given by an interpretation I = (ΔI , .I ) which
consists of a nonempty domain ΔI and an interpretation function .I . The function
.I assigns to each individual a an element aI ∈ ΔI , to each concept C a subset
C I ⊆ ΔI and to each role R a binary relation RI ⊆ ΔI × ΔI over ΔI . More-
over, the interpretation function .I is extended for all constructs of DL-LiteR . For
instance: (¬B)I = ΔI \B I , (∃R)I = {x ∈ ΔI |∃y ∈ ΔI such that (x, y) ∈ RI }
and (P − )I = {(y, x) ∈ ΔI × ΔI |(x, y) ∈ P I }. Concerning the TBox, we say that
I satisfies a concept (resp. role) inclusion axiom, denoted by I |= B  C (resp.
I |= R  E), iff B I ⊆ C I (resp. RI ⊆ E I ). Concerning the ABox, we say that
I satisfies a concept (resp. role) membership assertion, denoted by I |= A(ai ) (resp.
I |= P (ai , aj )), iff aIi ∈ AI (resp. (aIi , aIj ) ∈ P I ). Finally, an interpretation I is said
to satisfy K = T , A iff I satisfies every axiom in T and every assertion in A. Such
interpretation is said to be a model of K.
Incoherence and Inconsistency. Two kinds of inconsistency can be distinguished in
DL setting: incoherence and inconsistency [7]. A knowledge base is said to be incon-
sistent iff it does not admit any model and it is said to be incoherent if there exists at
least a non-satisfiable concept, namely for each interpretation I which is a model of T ,
we have C I = ∅. In DL-Lite setting a TBox T = {PIs, NIs} can be viewed as com-
posed of positive inclusion axioms, denoted by (PIs), and negative inclusion axioms,
denoted by (NIs). PIs are of the form B1  B2 or R1  R2 and NIs are of the form
B1  ¬B2 or R1  ¬R2 . The negative closure of T , denoted by cln(T ), represents the
propagation of the NIs using both PIs and NIs in the TBox (see [12] for more details).
Important properties have been established in [12] for consistency checking in DL-Lite:
K is consistent if and only if cln(T ), A is consistent. Moreover, every DL-Lite knowl-
edge base with only PIs in its TBox is always satisfiable. However when T contains NI
axioms then the DL-Lite knowledge base may be inconsistent and in an assertional-
based approach only elements of ABoxes are removed to restore consistency [13].
210 S. Benferhat et al.

3 Assertional Removed Sets Fusion


In this section, we study removed sets fusion to merge a set {A1 , · · · , An } of n asser-
tional bases, representing different sources of information, linked to a DL-lite ontology
T . As representation formalism, we consider MK = T , MA , an MBox knowledge
base where MA = {A1 , . . . , An } is called an MBox. An MBox is simply a multi-set
of membership assertions, where each Ai is an assertional base linked to T . We assume
that MK is coherent, i.e. T is coherent and for each Ai , 1 ≤ i ≤ n, T , Ai  is consis-
tent. However, the MBox MK may be inconsistent since the assertional bases Ai may
be conflicting w.r.t. T . We define the notion of conflict as a minimal inconsistent subset
of A1 ∪ . . . ∪ An , more formally:
Definition 1. Let MK = T , MA  be an inconsistent MBox DL-Lite knowledge base.
A conflict C is a set of membership assertions such that (i) C ⊆ A1 ∪ · · · ∪ An , (ii)
T , C is inconsistent, (iii) ∀C  , if C  ⊂ C then T , C   is consistent.
We denote by C(MK ) the collection of conflicts in MK . Since MK is assumed to be
finite, if MK is inconsistent then C(MK ) = ∅ is also finite.
Within the DL-Lite framework, in order to restore consistency, the following defini-
tion introduces the notion of potential assertional removed set.
Definition 2. Let MK = T , MA  be a MBox DL-Lite knowledge base. A potential
assertional removed set, denoted by X, is a set of membership assertions such that (i)
X ⊆ A1 ∪ · · · ∪ An , (ii) T , (A1 ∪ · · · ∪ An )\X is consistent, (iii) ∀X  , if X  ⊂ X ⊆
A1 ∪ · · · ∪ An then T , (A1 ∪ · · · ∪ An )\X   is inconsistent.
We denote by PR(MK ) the set of potential assertional removed sets of MK . If MK is
consistent then PR(MK ) = {∅}. The concept of potential assertional removed sets is
to some extent dual to the concept of repairs (maximally consistent subbase). Namely,
if X is a potential assertional removed set then (A1 ∪ · · · ∪ An )\X is a repair, and
conversely.
Example 1. Let MK = T , MA  be an inconsistent MBox DL-Lite knowledge base
such that T = {A  ¬B, C  ¬D} and MA = {A1 , A2 , A3 } where A1 =
{A(a), C(a)} A2 = {A(a), A(b)} and A3 = {B(a), D(a), C(b)}. By Definition 1,
C(MK ) = {{A(a), B(a)}, {C(a), D(a)}}. Hence, by Definition 2, PR(MK ) =
{{A(a), C(a)}, {A(a), D(a)}, {B(a), C(a)}, {B(a), D(a)}}.
In order to cope with conflicting sources, merging aims at exploiting the comple-
mentarity between the sources providing the ABoxes, so merging strategies are neces-
sary. These merging strategies are captured by total pre-orders on potential assertional
removed sets. Let X and Y be two potential assertional removed sets, for each strategy
P a total pre-order ≤P over the potential assertional removed sets is defined. X ≤P Y
means that X is preferred to Y according to the strategy P . We define <P as the strict
total pre-order associated to ≤P (i.e. X <P Y if and only if X ≤P Y and Y ≤P X).
Definition 3. Let MK = T , MA  be a MBox DL-Lite knowledge base. An assertional
removed set according to the strategy P , denoted by X, is a set of membership asser-
tions such that (i) X is a potential assertional removed set of MK ; (ii) there does not
exist any Y such that Y is a potential assertional removed set of MK and Y <P X.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 211

We denote by RP (MK ) the set of assertional removed sets according to the strat-
egy P of MK . If MK is consistent then RP (MK ) = {∅}. The usual merging strate-
gies sum-based (Σ), cardinality-based (Card), maximum-based (M ax) and lexico-
graphic ordering (GM ax) are captured by the following total pre-orders. We denote
by s(MA ) the ABox obtained from MK where every assertion expressed more than
once is reduced to a singleton.
 
(Σ): X ≤Σ Y if 1≤i≤n | X ∩ Ai |≤ 1≤i≤n | Y ∩ Ai | .
(Card): X ≤Card Y if |X ∩ s(MA )| ≤ |Y ∩ s(MA )|.
(M ax): X ≤M ax Y if max1≤i≤n | X ∩ Ai |≤ max1≤i≤n | Y ∩ Ai | .
(GM ax): For every potential assertional removed set X and every ABox Ai , we define
pA
X =| X ∩ Ai |. Let LX
i MA
be the sequence (pA An
X , . . . , pX ) sorted by decreasing
1

order. Let X and Y be two potential assertional removed sets of MK , X ≤GM ax


Y if LM X
A
≤lex LM Y
A1
.
The Σ strategy minimizes the number of assertions to remove from MA . The Card
strategy attempts, similarly to Σ, to minimize the number of removed assertions. But it
does not take into account assertions which are expressed several times. Note that the
Σ and Card strategies only differ if there are redundant assertions. The M ax strategy
tries to distribute to the best the assertions to be removed among to ABoxes. It tries
to do so by removing the less possible assertions in the most hit ABox. The GM ax
strategy is a lexicographic refinement of the M ax strategy. Note that when there is only
one source, all strategies become equivalent.
We now present assertional-based DL-LiteR merging operators. A merging opera-
tor is a function that maps an MBox DL-LiteR MK = T , MA  to a knowledge base
Δ(MK ) = T , Δ(MA ), where the function Δ defined from L × . . . × L to L, merges
according to a strategy a multiset of assertions MA into a set of assertions denoted by
Δ(MA ). In the DL-Lite language, it is not possible to find a set of assertions which
represents the disjunction of such possible merged sets of assertions. If we want to keep
the result of merging in DL-Lite, several options are possible. The first one is to con-
sider the intersection of all possible merged set of assertions however this option may be
too cautious since it could remove too many assertions and contradicts in some sense
the minimal change principle. Another option is to define a selection function which
allows us to define the family of ARSF operators. In this paper we consider the family
of selection functions that select exactly one assertional removed set as follows.
Definition 4. A selection function f is a mapping from RP (MK ) to A1 ∪ . . . ∪ An
such that (i) f (RP (MK )) = X with X ∈ RP (MK ), (ii) f ({∅}) = ∅.
Definition 5. Let MK = T , MA  be a MBox DL-Lite knowledge base, f be a selec-
tion function, and P be a strategy, the merged
 DL-Lite knowledge
 base, denoted by
Δarsf
P (M K ), is such that Δarsf
P (MK ) = T , Δarsf
P (MA ) where Δarsf
P (MA ) =
(A1 ∪ . . . ∪ An )\f (RP (MK )).
Let MK = T , MA  be a MBox DL-Lite knowledge base, and q(x) a query.
Querying multiple data sources
 is performed by
 querying merged data sources and
arsf
T , MA  |= q(x) amounts to T , ΔP (MA ) |= q(x).
1
(X1 , · · · , Xn ) ≤lex (Y1 , · · · , Yn ) if ∃i, 1 ≤ i ≤ n, (i) Xi ≤ Yi , (ii) ∀j, 1 ≤ j < i Xi =
Yi .
212 S. Benferhat et al.

Example 2. Let MK = T , MA  be the MBox of Example 1. The potential assertional


removed sets are X1 = {A(a), C(a)}, X2 = {A(a), D(a)}, X3 = {B(a), C(a)}
and X4 = {B(a), D(a)}. As illustrated in the table below2 , we have RΣ (MK ) =
{X3 , X4 }. Suppose the selection function f is such that f (RΣ (MK )) = X4 we
have Δarsf
Σ (MA ) = {A(a), C(a), A(b), C(b)}. We have RCard (MK ) = {X1 , X2 ,
X3 , X4 }. Suppose the selection function f is such that f (RCard (MK )) = X1 we
have Δarsf
Card (MA ) = {A(b), B(a), D(a), C(b)}. We have RM ax (MK ) = {X2 , X3 }.
Suppose the selection function f is such that f (RCard (MK )) = X2 we have
Δarsf
M ax (MA ) = {C(a), A(b), B(a), C(b)}. We have RGM ax (MK ) = {X3 } and
Δarsf
GM ax (MA ) = {A(a), D(a), A(b), C(b)}.

Xi |Xi ∩ A1 | |Xi ∩ A2 | |Xi ∩ A3 | Σ Card M ax GM ax


X1 2 1 0 3 2 2 210
X2 1 1 1 3 2 1 111
X3 1 0 1 2 2 1 110
X4 0 0 2 2 2 2 200

4 Logical Properties
Within the context of propositional logic, postulates have been proposed in order to
classify reasonable belief bases merging operators [14–16]3 . In order to give logical
properties of ARSF operators, we first rephrase these postulates within the DL-Lite
framework, and then analyse to which extent the proposed operators satisfy these pos-
tulates for any selection function.
Let MK = T , MA  and MK = T , MA  be two MBox DL-Lite knowledge
bases, let Δ be an assertional-based merging operator and T , Δ(MA ) be the DL-Lite
knowledge base resulting from merging, where Δ(MA ) is a set of assertions. Let σ be
a permutation over {1, . . . n}, and MA = {A1 , . . . , An } be a multiset of assertions,
σ(MA ) denotes the set {Aσ(1) , . . . , Aσ(n) }. We rephrase the postulates as follows:

Inclusion Δ(MA ) ⊆ A1 ∪ . . . ∪ An .
Symmetry For any permutation σ over {1, . . . n}, Δ(σ(MA )) = Δ(MA ).
Consistency T , Δ(MA ) is consistent.
Congruence If A1 ∪ . . . ∪ An = A1 ∪ . . . ∪ An then Δ(MA ) = Δ(MA ).
Vacuity If T , MA  is consistent then Δ(MA ) = A1 ∪ . . . ∪ An .
Reversion If T , MA  and T , MA  have the same minimal inconsistent sub-
sets then (A1 ∪ . . . ∪ An )\Δ(MA ) = (A1 ∪ . . . ∪ An )\Δ(MA ).

2
On each column the assertional removed sets are in bold.
3
We do not consider the IC postulates [21] since they apply to belief sets and not to belief bases.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 213

Core-retainment If α ∈ A1 ∪ . . . ∪ An and α ∈ Δ(MA ) then there exists A s. t.


A ⊆ A1 ∪ . . . ∪ An , A is consistent but A ∪ {α} is inconsistent.
Relevance If α ∈ A1 ∪ . . . ∪ An and α ∈ Δ(MA ) then there exists A s. t.
Δ(MA ) ⊆ A ⊆ A1 ∪ . . . ∪ An , A is consistent but A ∪ {α} is
inconsistent.

Inclusion states that the union of the initial ABoxes is the upper bound of any merging
operation. Symmetry establishes that all ABoxes are considered of equal importance.
Consistency requires the consistency of the result of merging. Congruence requires that
the result of merging should not depend on syntactic properties of the ABoxes. Vacuity
says that if the union of the ABoxes is consistent w.r.t. T then the result of merging
equals this union. Reversion says that if ABoxes have the same minimal inconsistent
subsets w.r.t. T then the assertions erased in the respective ABoxes are the same. Core-
retainment and Relevance express the intuition that nothing is removed from the original
ABoxes unless its removal in some way contribute to make the result consistent.
Proposition 1. Let MK = T , MA  be a MBox DL-Lite knowledge base. For any
selection function, ∀P ∈ {Σ, Card, M ax, GM ax}, Δarsf P satisfies the Inclusion,
Symmetry, Consistency, Vacuity, Core-retainment and Relevance. Δarsf
Card satisfies Con-
gruence and Reversion, but ∀P ∈ {Σ, M ax, GM ax}, Δarsf
P does not satisfy Congru-
ence nor Reversion.
(sketch of the proof) For any selection function, by Definitions 4 and 5, ∀P ∈
{Σ, Card, M ax, GM ax}, Δarsf P satisfies Inclusion, Symmetry, Consistency, Vacuity
and Core-retainment.
Relevance: By Definition 5, for any selection function f , ∀P ∈ {Σ, Card, M ax,
GM ax}, if α ∈ A1 ∪. . .∪An and α ∈ Δarsf P (MA ) then α ∈ f (RP (MK )). Let A =
arsf  
ΔP (MA ), A is consistent and A ∪ {α} is inconsistent since α ∈ f (RP (MK )) and
f (RP (MK ) is an assertional removed set. By Definition 5, Δarsf
Card satisfies Congruence
and Reversion since every assertion expressed more than once is reduced to a singleton.
We provide a counter-example for Δarsf P , ∀P ∈ {Σ, M ax, GM ax}. Let MK =
T , MA  be an inconsistent MBox DL-Lite knowledge base such that T = {A  ¬B}
and A1 = {A(a)}, A2 = {A(b), B(a)}, A3 = {B(a), A(b)}. The potential asser-
tional removed sets are PR(MK ) = {X1 , X2 , X3 , X4 } with X1 = {A(a), A(b)},
X2 = {A(a), B(b)}, X3 = {B(a), A(b)}, X4 = {B(a), B(b)} and the sets of
assertional removed sets are RΣ (MK ) = {X1 , X2 }, RM ax (MK ) = {X1 , X2 } and
RGM ax (MK ) = {X1 , X2 }.

Xi |Xi ∩ A1 | |Xi ∩ A2 | |Xi ∩ A3 | Σ M ax GM ax


X1 1 1 0 2 1 110
X2 1 0 1 2 1 110
X3 0 2 1 3 2 210
X4 0 1 2 3 2 210
214 S. Benferhat et al.

Besides, let MK = T , MA  be an inconsistent MBox DL-Lite knowledge base such
that T = {A  ¬B} and A1 = {A(a), B(b)}, A2 = {B(a)}, A3 = {A(a), A(a)}.
We have (A1 ∪ A2 ∪ A3 ) = (A1 ∪ A2 ∪ A3 ) and PR(MK ) = PR(MK ), and the sets
of assertional removed sets are RΣ (MK ) = {X3 , X4 }, RM ax (MK ) = {X3 , X4 }
and RGM ax (MK ) = {X3 , X4 }.

Xi |Xi ∩ A 1 | |Xi ∩ A 2 | |Xi ∩ A 3 | Σ M ax GM ax


X1 1 0 2 3 2 210
X2 2 0 1 3 2 110
X3 0 1 1 2 1 110
X4 1 1 0 2 1 110

∀P ∈ {Σ, M ax, GM ax} we have RP (MK ) = RP (MK ), and there is no selection


function such that f (RP (MK )) ∈ RP (MK ) therefore Δarsf
P (MA ) = Δarsf
P (MA ).

5 Computing ARSF Merging Outcome


We first show the one to one correspondence between potential assertional removed sets
and minimal hitting sets w.r.t. set inclusion [28]. We recall that a set H is a hitting set
of a collection of sets C iff ∀C ∈ C, C ∩ H = ∅.

Proposition 2. Let X be such that X ⊆ ∪1≤i≤n Ai . X is an potential assertional


removed set of MK if and only if X is minimal hitting set w.r.t. set inclusion of C(MK ).

The proof is straightforward following Definition 2. Notice that the algorithm for the
computation of the set of conflicts C(MK ) is done in polynomial w.r.t. the size of MK .
This can be found e.g. in [7]. In the following, we provide a single algorithm to compute
the potential assertional removed sets and the assertional removed sets according to the
strategies Card, Σ, M ax and Gmax. We give explanations on the different use cases of
this algorithm hereafter. For a given assertional base MK , the outcome of Algorithm 1
depends on the value of the parameter P : if P ∈ {Card, Σ, M ax, Gmax}, then the
result is RP (MK ). Otherwise the result is PR(MK ).
Let us first focus on the computation of PR(MK ). The algorithm is an adaptation
of the algorithm for the computation of the minimal hitting sets w.r.t. set inclusion of
a collection of sets described in [28]. It relies on the breadth-first construction of a
directed acyclic graph called an HS-dag. An HS-dag T is a dag with labeled nodes and
edges such that: (i) The root is labeled with ∅ if C(MK ) is empty, otherwise it is labeled
with an arbitrary element of C(MK ); (ii) for each node n of T , we denote by H(n) the
set of edge labels on the path from n to the root of T ; (iii) The label of a node n is any
set C ∈ C(MK ) such that C ∩ H(n) = ∅ if such a set exists. Otherwise n is labeled
with ∅. Nodes labeled with ∅ are called terminal nodes; (iv) If n is labeled by a set C,
then for each α ∈ C, n has a successor nα , joined to n by an edge labeled by α.
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 215

Algorithm 1. Computes the elements of RP (MK ) or the elements of PR(MK )


depending on the P parameter value.
1: function C OMPUTE - ASSERTIONAL -RS(MK , P )
   P : strategy
2: MK = T , MA , MA = {A1 , · · · , An }
3: level ← 0
4: label(root) ← an element C ∈ C(MK )  root is the root node
5: P revQ ← {root}  Queue of nodes in the previous level
6: if P ∈ {Σ, M ax, Gmax} then
7: M inN odes ← ∅  set of optimal nodes
8: M inCost ← ∞  ∞ for Σ and M ax, (∞, . . . , ∞) for GM ax
  
n times
9: mincard ← false  used by Card strategy
10: while P revQ = ∅ and not mincard do
11: level ← level + 1
12: CurQ ← ∅
13: for all no ∈ P revQ do
14: if label(no) = ∅ and label(no) =  then
15: label(no) = {α, β}
16: label(lef t_branch(no)) ← α
17: label(right_branch(no)) ← β
18: lef t_child(no) ←P ROCESS C HILD(α, no, CurQ, MK , M inCost, M inN odes, P )
19: right_child(no) ←P ROCESS C HILD(β, no, CurQ, MK , M inCost, M inN odes, P )
20: if label(lef t_child(no)) = ∅ or label(right_child(no)) = ∅ and P = Card then
21: mincard ← true
22: P revQ ← CurQ
23: if P ∈/ {Σ, M ax, Gmax} then
24: M inN odes ← all nodes labelled with ∅
25: return M inN odes

Algorithm 2. Process a child branch of a node. Return a node (new or recycled).


1: function P ROCESS C HILD(b_label, pa, CurQ, MK , M inCost, M inN odes, P )
 b_label: label of the branch to the new node
 pa: the parent node
 CurQ: queue of nodes already processed at the current level (input/output parameter)
 M inCost: current minimum cost (input/output parameter)
 M inN odes: set of current minimum cost nodes (input/output parameter)
   P : strategy
2: MK = T , MA
3: MA = {A1 , · · · , An }
4: if ∃n ∈ CurQ such that H(n ) = H(pa) ∪ {b_label} then
5: child_node ← n  no new node creation
6: else if ∃n ∈ T such that H(n ) ⊂ H(pa) ∪ {b_label} and label(n ) = ∅ then
7: child_node ← a new node
8: label(child_node) ←   this is a closed node
9: else if P ∈ {Σ, M ax, Gmax} and C OST(P, H(pa) ∪ {b_label}) > M inCost then
10: child_node ← a new node
11: label(child_node) ←   this is a closed node
12: else
13: child_node ← a new node
14: label(child_node) ← an element C ∈ C(MK ) such that C ∩ (H(pa) ∪ {b_label}) = ∅
15: CurQ ← CurQ ∪ {child_node}
16: if P ∈ {Σ, M ax, Gmax} and label(child_node) = ∅ then
17: if C OST(P, H(pa) ∪ {b_label}) < M inCost then
 Close current level nodes which are no more optimal
18: for all nopt ∈ M inN odes do
19: label(nopt) ← 
20: M inN odes ← ∅
21: M inCost ←C OST(P, H(pa) ∪ {b_label})
22: M inN odes ← M inN odes ∪ {child_node}
23: return child_node

In our case, the elements of C ∈ C(MK ) are such that |C| = 2 (see [12]), so the
HS-dag is binary. Algorithm 1 computes the potential assertional removed sets by com-
puting the minimal hitting sets w.r.t. set inclusion of C(MK ). It builds a pruned HS-dag
in a breadth-first order, using some pruning rules to avoid a complete development of
the branches. We move the processing of the left and right children nodes in a separate
function (described in Algorithm 2), as it first permits to keep the algorithm short and
simple, and second facilitates the extension of this algorithm to the computation of the
assertional removed sets according to the different strategies.
216 S. Benferhat et al.

P revQ and CurQ are sets containing respectively the nodes of the previous and the
current level. label(n) denotes the label of a node n. In a similar way, if b is a branch,
label(b) represents the label of b. lef t_branch(n) (resp. right_branch(n)) denotes
the left (resp. right) branch under the node n. lef t_child(n) (resp. right_child(n))
represent the left (resp. right) child node of the node n. The algorithm iterates the nodes
of a level and tries to develop the branches under each of these nodes. The central
property is that the conflict C labeling a node n is such that C ∩ H(n) = ∅.
Pruning rules are applied when trying to develop the left and right branches of some
parent node pa (lines 4–22 in function P ROCESS C HILD, Algorithm 2). Let us briefly
describe them: (i) if there exists a node n on the same level as the currently developed
child branch such that H(n ) = H(pa) ∪ {b_label} (b_label being the label of the
currently developed child branch), we connect the child branch to n , and there is no
node creation (line 4); (ii) if there exists a node n in the HS-dag such that H(n ) ⊂
H(pa)∪{b_label} and n is a terminal node, then the node connected to the child branch
is a closed node (which is marked with ) (line 6); (iii) otherwise the node connected
to the child branch is labelled by a conflict C such that H(pa) ∪ {b_label} ∩ C = ∅.
This new node is added to the current level queue.
Now we explain the aspects of the computation of the assertional removed sets
according to each strategy P . Card strategy. The Card strategy is the simplest one
to implement. First, observe that the level of a node n in the HS-dag is equal to the
cardinality of H(n). This means that if n is an end node (a node labeled with ∅), the
cardinality of the corresponding minimal hitting set is H(n). Thus, there is no need
to continue the construction of the HS-dag, as we are only interested in hitting sets
which are minimal w.r.t. cardinality. In the light of the preceding observation, The only
modification of the algorithm is the use of a boolean flag mincard which halts the
computation at the end of the level where the first potential assertional removed set has
been detected. Σ, M ax and GM ax strategies. As regards these strategies, we have
no guarantee that the assertional removed sets reside in the same level of the tree, as
illustrated by the following example for the Σ strategy.
Example 3. Let MK = T , MA  be an inconsistent MBox DL-Lite knowledge base
such that T = {A  ¬B, C  ¬B}, and A1 = {A(a)}, A2 = {C(a)}, A3 = {B(a)},
A4 = {B(a)}, A5 = {B(a)}. We have PR(MK ) = {{A(a), C(a)}, {B(a)}} and
RΣ (MK ) = {{A(a), C(a)}}. Thus the only assertional removed set is found at level
2, while the first potential assertional removed set is found at level 1.
Similar examples can be exhibited for the M ax and GM ax strategies. The search
strategy and associated pruning techniques for Σ, M ax and Gmax are located in lines 9
and 16 of Algorithm 2. They rely on a cost function which takes as parameters a strategy
and a set S of ABox assertions. The different cost functions are defined according to
the strategies, that is, given an MBox MA = {A1 ∪ . . . ∪ An }: For the Σ strategy
COST(Σ, S) computes |S ∩ A1 |+ . . . + |S ∩ An |. For the M ax strategy COST(M ax, S)
computes max(|S ∩A1 |, . . . , |S ∩An |), For the GM ax strategy, using pA
X = |X ∩Ai |,
i

MA A1 An
COST(GM ax, S) computes LX , which is the sequence (pX , . . . , pX ) sorted by
decreasing lexicographic order.
The variable M inCost maintains the current minimal cost. In line 9 of Algorithm 2,
if the cost of the current node is greater than M inCost, then the node is closed, as is
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 217

cannot be optimal. Otherwise we create a new node, labelled with a conflict which does
not intersect H(pa)∪{b_label}. If such a label cannot be found (line 16), i.e. the current
node is a terminal node then, at this point: (i) we are assured that C OST(P, H(pa) ∪
{b_label}) ≤ M inCost, so we add the new node to the set of currently optimal nodes
(line 22); (ii) if the cost of the current node is strictly less than M inCost, then we close
all nodes currently believed to be optimal, empty the set containing them, and update
M inCost (lines 18–21).
Example 4. We illustrate the operation of the algorithm with the computation of the
assertional removed sets of Example 2. Figure 1 depicts the HS-dag built by Algo-
rithm 1. Circled numbers shows the ordering of nodes (apart from root which is obvi-
ously the first node).

{A(a), B(a)}
A(a) B(a)
{C(a), D(a)} 1 {C(a), D(a)} 2

C(a) D(a) C(a) D(a)

∅ (1) ∅ (2) ∅ (3) ∅ (4)


Σ=3 Σ=3 Σ=2 Σ=2
3 4 5 6
M ax = 2 M ax = 1 M ax = 1 M ax = 2
GM ax = (2, 1, 0) GM ax = (1, 1, 1) GM ax = (1, 1, 0) GM ax = (2, 0, 0)

Fig. 1. Computing the removed sets of Example 2.

In order to facilitate the description, we denote by M inN odesP the variable


M inN ode when considering strategy P . The same applies for M inCost. At the end
of the execution of the processing of a node (P ROCESS C HILD function), a state of these
variables is given.
root The root is labelled with a conflict.
level 1
– Left and right branches of root node are labelled respectively with A(a) and
B(a), the members of the root label (lines 16–17 of Algorithm 1).
– P ROCESS C HILD(α, no, CurQ, MK , M inCost, M inN odes, P ) is called. None of
the pruning conditions in lines 4, 6 and 9 apply, so node  1 is created, and
labelled with a conflict not intersecting H()1 = A(a), namely {C(a), D(a)}.
The same processing leads to the creation of node . 2
State: M inN odes = ∅, M inCost = ∞ for any strategy
level 2
– Left and right branches of node  1 are labelled respectively with C(a) and D(a),
the members of the label (lines 16–17 of Algorithm 1).
– P ROCESS C HILD(α, no, CurQ, MK , M inCost, M inN odes, P ) (left branch
of node )
1 is called. None of the pruning conditions in lines 4, 6 and 9 apply,
so node  3 is created. As there is no conflict C such that C ∩ H() 3 = ∅,
218 S. Benferhat et al.

the new node is labelled with ∅. Whatever the strategy is, its cost is necessarily
less than M inCost which has been initialized to ∞. Thus M inCost is updated
to the cost of node  3 depending on the strategy and node  3 is added to the
M inN odes set.
State: M inN odes = {}, 3 M inCostΣ = 3, M inCostM ax = 2,
M inCostGM ax = (2, 1, 0).
– P ROCESS C HILD(β, no, CurQ, MK , M inCost, M inN odes, P ) (right branch
of node )1 is called. None of the pruning conditions in lines 4, 6 and 9 apply,
so node  4 is created. As there is no conflict C such that C ∩ H() 4 = ∅,
the new node is labelled with ∅. For strategy Σ, the cost of node  4 is equal to
M inCost, thus node  4 is added to the M inN odes set. For strategies M ax and
GM ax, the cost of node  4 is less than M inCost: node  3 is closed (line 18),
set M inN odes is emptied, and M inCost is updated.
State: M inN odesΣ = {, 3 },
4 M inN odesM ax = {}, 4 M inN odesGM ax =
4 M inCostΣ = 3, M inCostM ax = 1, M inCostGM ax = (1, 1, 1).
{},
– Left and right branches of node  2 are labelled respectively with C(a) and D(a),
the members of the label (lines 16–17 of Algorithm 1).
– P ROCESS C HILD(α, no, CurQ, MK , M inCost, M inN odes, P ) (left branch
of node )2 is called. None of the pruning conditions in lines 4, 6 and 9 apply,
so node  5 is created. As there is no conflict C such that C ∩ H() 5 = ∅, the
new node is labelled with ∅. For strategy Σ, The cost of node  5 (2) is less than
M inCost. The same applies for GM ax
State: M inN odesΣ = {}, 5 M inN odesM ax = {, 4 },
5 M inN odesGM ax =
5 M inCostΣ = 2, M inCostM ax = 1, M inCostGM ax = (1, 1, 0).
{},
– P ROCESS C HILD(β, no, CurQ, MK , M inCost, M inN odes, P ) (right branch
of node )2 is called. None of the pruning conditions apply, so node  6 is cre-
ated. As there is no conflict C such that C ∩H() 6 = ∅, the new node is labelled
with ∅. For strategy Σ, The cost of node  6 (2) is equal to M inCost.
State: M inN odesΣ = {, 5 },
6 M inN odesM ax = {, 4 },
5 M inN odesGM ax =
5 , M inCostΣ = 2, M inCostM ax = 1, M inCostGM ax = (1, 1, 0).
{}

6 Conclusion

In this paper, we proposed new family of assertional-based merging operators, called


Assertional Removed Sets Fusion (ARSF) operators, following several merging strate-
gies (Σ, Card, M ax, GM ax). We studied the behaviour of ARSF operators with
respect to a set of logical postulates (initially stated for propositional formula-based
merging), which we rephrased within the DL-Lite framework. From a computational
point of view, we proposed algorithms, stemming from the notion of hitting set, for
computing the potential assertional removed sets as well as the assertional removed
sets according to the different used strategies.
Belief change has been investigated within the framework of DL-Lite. Calvanese et
al. [13] adapted formula-based and model-based approaches of ABox and Tbox belief
revision and update, however they did not consider belief merging. Wang et al. [27]
addressed the problem of TBox DL-Lite KB merging by adapting classical model-based
Assertional Removed Sets Merging of DL-Lite Knowledge Bases 219

belief merging to DL-Lite. This approach differs from the one we propose since we
extend formula-based merging to DL lite.
In a future work we plan to conduct a complexity analysis of the proposed algo-
rithm for the different used merging strategies. Moreover, we also want to focus on
the implementation of ARSF operators and on an experimental study on real world
applications, in particular 3D surveys within the context of underwater archaeology
and handling conflicts in dances’ videos. Furthermore, the ARSF operators stem from
a selection function that selects one assertional removed set, we also plan to investigate
operators stemming from other selection functions as well as other strategies and other
approaches than ARSF for performing assertional-based merging.

Acknowledgements. This work is partially supported by the European project H2020-MSCA-


RISE: AniAge (High Dimensional Heterogeneous Data based Animation Techniques for South-
east Asian Intangible Cultural Heritage). Zied Bouraoui was supported by CNRS PEPS INS2I
MODERN.

References
1. Alchourrón, C., Gärdenfors, P., Makinson, D.: On the logic of theory change: partial meet
contraction and revision functions. J. Symb. Log. 50(2), 510–530 (1985)
2. Arenas, M., Bertossi, L.E., Chomicki, J.: Consistent query answers in inconsistent databases.
In: Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Prin-
ciples of Database Systems, Philadelphia, Pennsylvania, USA, pp. 68–79 (1999)
3. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family and rela-
tions. J. Artif. Intell. Res. (JAIR) 36, 1–69 (2009)
4. Baget, J.F., et al.: A general modifier-based framework for inconsistency-tolerant query
answering. In: Principles of Knowledge Representation and Reasoning: Proceedings of the
Fifteenth International Conference, KR 2016, Cape Town, South Africa, 25–29 April 2016,
pp. 513–516 (2016)
5. Baget, J.F., et al.: Inconsistency-tolerant query answering: rationality properties and compu-
tational complexity analysis. In: Michael, L., Kakas, A. (eds.) JELIA 2016. LNCS (LNAI),
vol. 10021, pp. 64–80. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48758-
8_5
6. Baral, C., Kraus, S., Minker, J., Subrahmanian, V.S.: Combining knowledge bases consisting
of first order theories. Comp. Intell. 8(1), 45–71 (1992)
7. Benferhat, S., Bouraoui, Z., Papini, O., Würbel, E.: Assertional-based removed sets revision
of DL-LiteR knowledge bases. In: ISAIM (2014)
8. Benferhat, S., Dubois, D., Kaci, S., Prade, H.: Possibilistic merging and distance-based
fusion of propositional information. Stud. Logica. 58(1), 17–45 (1997)
9. Bienvenu, M.: On the complexity of consistent query answering in the presence of simple
ontologies. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence
(2012)
10. Bloch, I., Hunter, A., et al.: Fusion: general concepts and characteristics. Int. J. Intell. Syst.
16(10), 1107–1134 (2001)
11. Bloch, I., Lang, J.: Towards mathematical morpho-logics. In: Bouchon-Meunier, B.,
Gutiérrez-Ríos, J., Magdalena, L., Yager, R.R. (eds.) Technologies for Constructing Intel-
ligent Systems 2. STUDFUZZ, vol. 90, pp. 367–380. Physica, Heidelberg (2002). https://
doi.org/10.1007/978-3-7908-1796-6_29
220 S. Benferhat et al.

12. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
and efficient query answering in description logics: the DL-Lite family. J. Autom. Reasoning
39(3), 385–429 (2007)
13. Calvanese, D., Kharlamov, E., Nutt, W., Zheleznyakov, D.: Evolution of DL - Lite knowledge
bases. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 112–128.
Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_8
14. Falappa, M.A., Kern-Isberner, G., Reis, M.D.L., Simari, G.R.: Prioritized and non-prioritized
multiple change on belief bases. J. Philos. Log. 41, 77–113 (2012)
15. Falappa, M.A., Kern-Isberner, G., Simari, G.R.: Explanations, belief revision and defeasible
reasoning. Artif. Intell. 141(1/2), 1–28 (2002)
16. Fuhrmann, A.: An Essay on Contraction. CSLI Publications, Stanford (1997)
17. Hue, J., Papini, O., Würbel, E.: Syntactic propositional belief bases fusion with removed
sets. In: Mellouli, K. (ed.) ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 66–77. Springer,
Heidelberg (2007). https://doi.org/10.1007/978-3-540-75256-1_9
18. Hué, J., Würbel, E., Papini, O.: Removed sets fusion: performing off the shelf. In: Proceed-
ings of ECAI 2008 (FIAI 178), pp. 94–98 (2008)
19. Konieczny, S.: On the difference between merging knowledge bases and combining them.
In: Proceedings of KR 2000, pp. 135–144 (2000)
20. Konieczny, S., Lang, J., Marquis, P.: DA2 merging operators. Artif. Intell. 157, 49–79 (2004)
21. Konieczny, S., Pérez, R.P.: Merging information under constraints. J. Log. Comput. 12(5),
773–808 (2002)
22. Lembo, D., Lenzerini, M., Rosati, R., Ruzzi, M., Savo, D.F.: Inconsistency-tolerant query
answering in ontology-based data access. J. Web Sem. 33, 3–29 (2015)
23. Lin, J., Mendelzon, A.: Knowledge base merging by majority. In: Pareschi, R., Fronhoefer,
B. (eds.) In Dynamic Worlds: From the Frame Problem to Knowledge Management. Kluwer,
Dordrecht (1999)
24. Meyer, T., Ghose, A., Chopra, S.: Syntactic representations of semantic merging operations.
In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, p. 620. Springer,
Heidelberg (2002). https://doi.org/10.1007/3-540-45683-X_88
25. Revesz, P.Z.: On the semantics of theory change: arbitration between old and new informa-
tion. In: 12th ACM SIGACT-SGMIT-SIGART Symposium on Principes of Databases, pp.
71–92 (1993)
26. Revesz, P.Z.: On the semantics of arbitration. J. Algebra Comput. 7, 133–160 (1997)
27. Wang, Z., Wang, K., Jin, Y., Qi, G.: Ontomerge a system for merging DL-Lite ontologies.
In: CEUR Workshop Proceedings, vol. 969, pp. 16–27 (2014)
28. Wilkerson, R.W., Greiner, R., Smith, B.A.: A correction to the algorithm in Reiter’s theory
of diagnosis. Artif. Intell. 41, 79–88 (1989)
An Interactive Polyhedral Approach
for Multi-objective Combinatorial
Optimization with Incomplete Preference
Information

Nawal Benabbou and Thibaut Lust(B)

Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, LIP6,


75005 Paris, France
{nawal.benabbou,thibaut.lust}@lip6.fr

Abstract. In this paper, we develop a general interactive polyhedral


approach to solve multi-objective combinatorial optimization problems
with incomplete preference information. Assuming that preferences can
be represented by a parameterized scalarizing function, we iteratively ask
preferences queries to the decision maker in order to reduce the impre-
cision over the preference parameters until being able to determine her
preferred solution. To produce informative preference queries at each
step, we generate promising solutions using the extreme points of the
polyhedron representing the admissible preference parameters and then
we ask the decision maker to compare two of these solutions (we pro-
pose different selection strategies). These extreme points are also used
to provide a stopping criterion guaranteeing that the returned solution is
optimal (or near-optimal) according to the decision maker’s preferences.
We provide numerical results for the multi-objective spanning tree and
traveling salesman problems with preferences represented by a weighted
sum to demonstrate the practical efficiency of our approach. We com-
pare our results to a recent approach based on minimax regret, where
preference queries are generated during the construction of an optimal
solution. We show that better results are achieved by our method both
in terms of running time and number of questions.

Keywords: Multi-objective combinatorial optimization · Minimum


spanning tree problem · Traveling salesman problem · Incremental
preference elicitation · Minimax regret

1 Introduction
The increasing complexity of applications encountered in Computer Science sig-
nificantly complicates the task of decision makers who need to find the best
solution among a very large number of options. Multi-objective optimization
is concerned with optimization problems involving several (conflicting) objec-
tives/criteria to be optimized simultaneously (e.g., minimizing costs while max-
imizing profits). Without preference information, we only know that the best
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 221–235, 2019.
https://doi.org/10.1007/978-3-030-35514-2_17
222 N. Benabbou and T. Lust

solution for the decision maker (DM) is among the Pareto-optimal solutions (a
solution is called Pareto-optimal if there exists no other solution that is better
on all objectives while being strictly better on at least one of them). The main
problem with this kind of approach is that the number of Pareto-optimal solu-
tions can be intractable, that is exponential in the size of the problem (e.g. [13]
for the multicriteria spanning tree problem). One way to address this issue is to
restrict the size of the Pareto set in order to obtain a “well-represented” Pareto
set; this approach is often based on a division of the objective space into differ-
ent regions (e.g., [15]) or on -dominance (e.g., [18]). However, whenever the DM
needs to identify the best solution, it seems more appropriate to refine the Pareto
dominance relation with preferences to determine a single solution satisfying the
subjective preferences of the DM. Of course, this implies the participation of the
DM who has to give us some insights and share her preferences.
In this work, we assume that the DM’s preferences can be represented by a
parameterized scalarizing function (e.g., a weighted sum), allowing some trade-
off between the objectives, but the corresponding preference parameters (e.g.,
the weights) are initially not known; hence, we have to consider the set of all
parameters compatible with the collected preference information. An interesting
approach to deal with preference imprecision has been recently developed [19,
21,30] and consists in determining the possibly optimal solutions, that is the
solutions that are optimal for at least one instance of the preference parameters.
The main drawback of this approach, though, is that the number of possibly
optimal solutions may still be very large compared to the number of Pareto-
optimal solutions; therefore there is a need for elicitation methods aiming to
specify the preference model by asking preference queries to the DM.
In this paper, we study the potential of incremental preference elicitation
(e.g., [23,27]) in the framework of multi-objective combinatorial optimization.
Preference elicitation on combinatorial domains is an active topic that has been
recently studied in various contexts, e.g. in multi-agents systems [1,3,6], in stable
matching problems [9], in constraint satisfaction problems [7], in Markov Deci-
sion Processes [11,24,28] and in multi-objective optimization problems [4,14,16].
Our aim here is to propose a general interactive approach for multi-objective
optimization with imprecise preference parameters. Our approach identifies
informative preference queries by exploiting the extreme points of the polyhe-
dron representing the admissible preference parameters. Moreover, these extreme
points are also used to provide a stopping criterion which guarantees the deter-
mination of the (near-)optimal solution. Our approach is general in the sense
that it can be applied to any multi-objective optimization problem, providing
that the scalarizing function is linear in its preference parameters (e.g., weighted
sums, Choquet integrals [8,12]) and that there exists an efficient algorithm to
solve the problem when preferences are precisely known (e.g., [17,22] for the
minimum spanning tree problem with a weighted sum).
The paper is organized as follows: We first give general notations and recall
the basic principles of regret-based incremental elicitation. We then propose
a new interactive method based on the minimax regret decision criterion and
An Interactive Polyhedral Approach for MOCO Problems 223

extreme points generation. Finally, to show the efficiency of our method, we


provide numerical results for two well-known problems, namely the multicriteria
traveling salesman and multicriteria spanning tree problems; for the latter, we
compare our results with those obtained by the state-of-the-art method.

2 Multi-objective Combinatorial Optimization

In this paper, we consider a general multi-objective combinatorial optimization


(MOCO) problem with n objective functions yi , i ∈ {1, . . . , n}, to be minimized.
This problem can be defined as follows:
 
minimize y1 (x), . . . , yn (x)
x∈X

In this definition, X is the feasible set in the decision space, typically defined by
some constraint functions (e.g., for the multicriteria spanning tree problem, X is
the set of all spanning trees of the graph). In this problem, any solution x ∈ X is
associated with a cost vector y(x) = (y1 (x), . . . , yn (x)) ∈ Rn where yi (x) is the
evaluation of x on the i-th criterion/objective. Thus the image of the feasible
set in the objective space is defined by {y(x) : x ∈ X } ⊂ Rn .
Solutions are usually compared through their images in the objective space
(also called points) using the Pareto dominance relation: we say that point u =
(u1 , . . . , un ) ∈ Rn Pareto dominates point v = (v1 , . . . , vn ) ∈ Rn (denoted by
u ≺P v) if and only if ui ≤ vi for all i ∈ {1, . . . , n}, with at least one strict
inequality. Solution x∗ ∈ X is called efficient if there does not exist any other
feasible solution x ∈ X such that y(x) ≺P y(x∗ ); its image in objective space is
then called a non-dominated point.

3 Minimax Regret Criterion

We assume here that the DM’s preferences over solutions can be represented by a
parameterized scalarizing function fω that is linear in its parameters ω. Solution
x ∈ X is preferred to solution x ∈ X if and only if fω (y(x)) ≤ fω (y(x )).
ngive a few examples, function fω can be a weighted sum (i.e. fω (y(x)) =
To
i=1 ωi yi (x)) or a Choquet integral with capacity ω [8,12]. We also assume that
parameters ω are not known initially. Instead, we consider a (possibly empty)
set Θ of pairs (u, v) ∈ Rn × Rn such that u is known to be preferred to v; this set
can be obtained by asking preference queries to the DM. Let ΩΘ be the set of all
parameters ω that are compatible with Θ, i.e. all parameters ω that satisfy the
constraints fω (u) ≤ fω (v) for all (u, v) ∈ Θ. Thus, since fω is linear in ω, we can
assume that ΩΘ is a convex polyhedron throughout the paper. The problem is
now to determine the most promising solution under the preference imprecision
(defined by ΩΘ ). To do so, we use the minimax regret approach (e.g., [7]) which
is based on the following definitions:
224 N. Benabbou and T. Lust

Definition 1 (Pairwise Max Regret). The Pairwise Max Regret (PMR) of


solution x ∈ X with respect to solution x ∈ X is:

P M R(x, x , ΩΘ ) = max {fω (y(x)) − fω (y(x ))}


ω∈ΩΘ

In other words, P M R(x, x , ΩΘ ) is the worst-case loss when choosing solution x


instead of solution x .

Definition 2 (Max Regret). The Max Regret (MR) of solution x ∈ X is:

M R(x, X , ΩΘ ) = max

P M R(x, x , ΩΘ )
x ∈X

Thus M R(x, X , ΩΘ ) is the worst-case loss when selecting solution x instead of


any other feasible solution x ∈ X . We can now define the minimax regret:

Definition 3 (Minimax Regret). The MiniMax Regret (MMR) is:

M M R(X , ΩΘ ) = min M R(x, X , ΩΘ )


x∈X

According to the minimax regret criterion, an optimal solution is a solution that


achieves the minimax regret (i.e., any solution in arg minx∈X M R(x, X , ΩΘ )),
allowing to minimize the worst-case loss. Note that if M M R(X , ΩΘ ) = 0, then
any optimal solution for the minimax regret criterion is necessarily optimal
according to the DM’s preferences.

4 An Interactive Polyhedral Method


Our aim is to produce an efficient regret-based interactive method for the deter-
mination of a (near-)optimal solution according to the DM’s preferences. Note
that the value M M R(X , ΩΘ ) can only decrease when inserting new preference
information in Θ, as observed in previous works (see e.g., [5]). Therefore, the
general idea of regret-based incremental elicitation is to ask preference queries
to the DM in an iterative way, until the value M M R(X , ΩΘ ) drops below a
given threshold δ ≥ 0 representing the maximum allowable gap to optimality;
one can simply set δ = 0 to obtain the preferred solution (i.e., the optimal
solution according to the DM’s preferences).
At each iteration step, the minimax regret M M R(X , ΩΘ ) could be obtained
by computing the pairwise max regrets P M R(x, x , ΩΘ ) for all pairs (x, x ) of
distinct solutions in X (see Definitions 2 and 3). However, this would not be very
efficient in practice due to the large size of X (recall that X is the feasible set of
a MOCO problem). This observation has led a group of researchers to propose a
new approach consisting in combining preference elicitation and search by asking
preference queries during the construction of the (near-)optimal solution (e.g.,
[2]). In this work, we propose to combine incremental elicitation and search in
a different way: at each iteration step, we generate a set of promising solutions
using the extreme points of ΩΘ (the set of admissible parameters), we ask the
An Interactive Polyhedral Approach for MOCO Problems 225

DM to compare two of these solutions, we update ΩΘ according to her answer


and we stop the process whenever a (near-)optimal solution is detected (i.e. a
solution x ∈ X such that M R(x, X , ΩΘ ) ≤ δ holds). More precisely, taking as
input a MOCO problem P , a tolerance threshold δ ≥ 0, a scalarizing function
fω with unknown parameters ω and an initial set of preference statements Θ,
our algorithm iterates as follows:

1. First, the set of all extreme points of polyhedron ΩΘ are generated. This set
is denoted by EPΘ and its kth element is denoted by ω k .
2. Then, for every point ω k ∈ EPΘ , P is solved considering the precise scalar-
izing function fωk (the corresponding optimal solution is denoted by xk ).
3. Finally M M R(XΘ , ΩΘ ) is computed, where XΘ = {xk : k ∈ {1, . . . , |EPΘ |}}.
If this value is strictly larger than δ, then the DM is asked to compare two
solutions x, x ∈ XΘ and ΩΘ is updated by imposing the linear constraint
fω (x) ≤ fω (x ) (or fω (x) ≥ fω (x ) depending on her answer); the algorithm
stops otherwise.

Our algorithm, called IEEP (for Incremental Elicitation based on Extreme


Points), is summarized in Algorithm 1. The implementation details of Select,
Optimizing and ExtremePoints procedures are given in the numerical section.
Note however that Optimizing is a procedure that depends on the optimization
problem (e.g., Prim algorithm could be used for the spanning tree problem). The
following proposition establishes the validity of our interactive method:

Proposition 1. For any positive tolerance threshold δ, algorithm IEEP returns


a solution x∗ ∈ X such that the inequality M R(x∗ , X , ΩΘ ) ≤ δ holds.

Proof. Let x∗ be the returned solution and let K be the number of extreme
points of ΩΘ at the end of the execution. For all k ∈ {1, . . . , K}, let ω k be the
kth extreme point of ΩΘ and let xk be a solution minimizing function fωk . Let
XΘ = {xk : k ∈ {1, . . . , K}}. We know that M R(x∗ , XΘ , ΩΘ ) ≤ δ holds at the
end of the while loop (see the loop condition); hence we have fω (x∗ )−fω (xk ) ≤ δ
for all solutions xk ∈ XΘ and all parameters ω ∈ ΩΘ (see Definition 2).
We want to prove that M R(x∗ , X , ΩΘ ) ≤ δ holds at the end of execution. To
do so, it is sufficient to prove that fω (x∗ ) − fω (x) ≤ δ holds for all x ∈ X and
all ω ∈ ΩΘ . Since ΩΘ is a convex polyhedron, Kfor any ω ∈ ΩΘ , there
K exists a
vector λ = (λ1 , . . . , λK ) ∈ [0, 1]K such that k=1 λk = 1 and ω = k=1 λk ω k .
Therefore, for all solutions x ∈ X and for all parameters ω ∈ ΩΘ , we have:
K 
 
fω (x∗ ) − fω (x) = λk (fωk (x∗ ) − fωk (x)) by linearity
k=1
K 
 
≤ λk (fωk (x∗ )−fωk (xk )) since xk is fωk -optimal
k=1
226 N. Benabbou and T. Lust

K 
 
≤ λk × δ since fω (x∗ ) − fω (xk ) ≤ δ
k=1
K

=δ× λk
k=1
= δ. 


For illustration proposes, we now present the execution of our algorithm on a


small instance of the multicriteria spanning tree problem.

Algorithm 1. IEEP
IN ↓ P : a MOCO problem; δ: a threshold; fω : a scalarizing function with unknown
parameters ω; Θ: a set of preference statements.
OUT ↑: a solution x∗ with a max regret smaller than δ.
- -| Initialization of the convex polyhedron:
ΩΘ ← {ω : ∀(u, v) ∈ Θ, fω (u) ≤ fω (v)}
- -| Generation of the extreme points of the polyhedron:
EPΘ ←ExtremePoints(ΩΘ )
- -| Generation of the optimal solutions attached to EPΘ :
XΘ ← Optimizing(P ,EPΘ )
while M M R(XΘ , ΩΘ ) > δ do
- -| Selection of two solutions to compare:
(x, x ) ← Select(XΘ )
- -| Question:
query(x, x )
- -| Update preference information:
if x is preferred to x then
Θ ← Θ ∪ {(y(x), y(x ))}
else
Θ ← Θ ∪ {(y(x ), y(x))}
end
ΩΘ ← {ω : ∀(u, v) ∈ Θ, fω (u) ≤ fω (v)}
- -| Generation of the extreme points of the polyhedron:
EPΘ ←ExtremePoints(ΩΘ )
- -| Generation of the optimal solutions attached to EPΘ :
XΘ ← Optimizing(P ,EPΘ )
end
return a solution x∗ ∈ XΘ minimizing M R(x, XΘ , ΩΘ )

Example 1. Consider the multicriteria spanning tree problem with 5 nodes and
7 edges given in Fig. 1. Each edge is evaluated with respect to 3 criteria. Assume
that the DM’s preferences can be represented by a weighted sum fω with
unknown parameters ω. Our goal is to determine an optimal spanning tree for the
DM (δ = 0), i.e. a connected acyclic sub-graph with 5 nodes that is fω -optimal.
An Interactive Polyhedral Approach for MOCO Problems 227

We now apply algorithm IEEP on this instance, starting with an empty set of
preference statements (i.e. Θ = ∅).
Initialization: As Θ = ∅, ΩΘ is initialized to the set of all weighting vectors
ω = (ω1 , ω2 , ω3 ) ∈ [0, 1]3 such that ω1 + ω2 + ω3 = 1. In Fig. 2, ΩΘ is represented
by the triangle ABC in the space (ω1 , ω2 ); value ω3 is implicitly defined by
ω3 = 1 − ω1 − ω2 . Hence the initial extreme points are the vectors of the natural
basis of the Euclidean space, corresponding to Pareto dominance [29]; in other
words, we have EPΘ = {ω 1 , ω 2 , ω 3 } with ω 1 = (1, 0, 0), ω 2 = (0, 1, 0) and
ω 3 = (0, 0, 1). We then optimize according to all weighting vectors in EPΘ using
Prim algorithm [22], and we obtain the following three solutions: for ω 1 , we
have a spanning tree x1 evaluated by y(x1 ) = (15, 17, 14); for ω 2 , we obtain a
spanning tree x2 with y(x2 ) = (23, 8, 16); for ω 3 , we find a spanning tree x3 such
that y(x3 ) = (17, 16, 11). Hence we have XΘ = {x1 , x2 , x3 }.

(8,1,1)
1 2
(3,4,7)

(7,7,7)
3 (2,2,2) (4,9,1)

(7,3,9)
5 4
(6,2,4)

Fig. 1. A three-criteria minimum spanning tree problem.

Iteration Step 1: Since M M R(XΘ , ΩΘ ) = 8 > δ = 0, we ask the DM to


compare two solutions in XΘ , say x1 and x2 . Assume that the DM prefers x2 .
In that case, we perform the following updates: Θ = {((23, 8, 16), (15, 17, 14))}
and ΩΘ = {ω : fω (23, 8, 16) ≤ fω (15, 17, 14)}; in Fig. 3, ΩΘ is represented by
triangle BFE. We then compute the set EPΘ of its extreme points (by applying
the algorithm in [10] for example) and we obtain EPΘ = {ω 1 , ω 2 , ω 3 } with ω 1 =
(0.53, 0.47, 0), ω 2 = (0, 0.18, 0.82) and ω 3 = (0, 1, 0). We optimize according
to these weights and we obtain three spanning trees: XΘ = {x1 , x2 , x3 } with
y(x1 ) = (23, 8, 16), y(x2 ) = (17, 16, 11) and y(x3 ) = (19, 9, 14).
Iteration Step 2: Here M M R(XΘ , ΩΘ ) = 1.18 > δ = 0. Therefore, we ask
the DM to compare two solutions in XΘ , say x1 and x2 . Assume she prefers x2 .
We then obtain Θ = {((23, 8, 16), (15, 17, 14)), ((17, 16, 11), (23, 8, 16))} and we
set ΩΘ = {ω : fω (23, 8, 16) ≤ fω (15, 17, 14) ∧ fω (17, 16, 11) ≤ fω (23, 8, 16)}.
We compute the corresponding extreme points which are given by EPΘ =
{(0.43, 0.42, 0.15), (0, 0.18, 0.82), (0, 0.38, 0.62)} (see triangle HGE in Fig. 4);
finally we have XΘ = {x1 , x2 } with y(x1 ) = (17, 16, 11) and y(x2 ) = (19, 9, 14).
Iteration Step 3: Now M M R(XΘ , ΩΘ ) = 1.18 > δ = 0. Therefore
we ask the DM to compare x1 and x2 . Assuming that she prefers x2 , we
228 N. Benabbou and T. Lust

update Θ by inserting the preference statement ((19, 9, 14), (17, 16, 11)) and
we update ΩΘ by imposing the following additional constraint: fω (19, 9, 14) ≤
fω (17, 16, 11) (see Fig. 5); the corresponding extreme points are given by EPΘ =
{(0.18, 0.28, 0.54), (0, 0.3, 0.7), (0, 0.38, 0.62), (0.43, 0.42, 0.15)}. Now the set XΘ
only includes one spanning tree x1 and y(x1 ) = (19, 9, 14). Finally, the algorithm
stops (since we have M M R(XΘ , ΩΘ ) = 0 ≤ δ = 0) and it returns solution x1
(which is guaranteed to be the optimal solution for the DM).

ω2 ω2 ω2 ω2
1•B 1•B 1 + 1+

F G G
• H • H •
• •
J• •
A C E• E• I
• • + + +
0 ω
1 1 0 1 ω1 0 1 ω1 0 1 ω1

Fig. 2. Initial set. Fig. 3. After step 1. Fig. 4. After step 2. Fig. 5. After step 3.

5 Experimental Results

We now provide numerical results aiming to evaluate the performance of our


interactive approach. At each iteration step of our procedure, the DM is asked
to compare two solutions selected from the set XΘ until M R(XΘ , ΩΘ ) ≤ δ.
Therefore, we need to estimate the impact of procedure Select on the per-
formances of our algorithm. Here we consider the following query generation
strategies:

– Random: The two solutions are randomly chosen in XΘ .


– Max-Dist: We compute the Euclidean distance between all solutions in the
objective space and we choose a pair of solutions maximizing the distance.
– CSS: The Current Solution Strategy (CSS) consists in selecting a solution
that minimizes the max regret and one of its adversary’s choice [7]1 .

These strategies are compared using the following indicators:

– time: The running time given in seconds.


– eval: The number of evaluations, i.e. the number of problems with known
preferences that are solved during the execution; recall that we solve one opti-
mization problem per extreme point at each iteration step (see Optimizing).
– queries: The number of preference queries generated during the execution.
– qOpt: The number of preference queries generated until the determination
of the preferred solution (but not yet proved optimal).
1
Note that these three strategies are equivalent when only considering two objectives
since the number of extreme points is always equal to two in this particular case.
An Interactive Polyhedral Approach for MOCO Problems 229

We assume here that the DM’s preferences can be represented by a weighted


sum fω but the weights ω = (ω1 , . . . , ωn ) are not known initially. More precisely,
we start the execution  with an empty set of preference statements (i.e. Θ = ∅
n
and ΩΘ = {ω ∈ Rn+ : i=1 ωi = 1}) and then any new preference statement
(u, v) ∈ R obtained
2
n from the DM induces the following linear constraint over
n
the weights: i=1 ωi ui ≤ i=1 ωi vi . Hence ΩΘ is a convex polyhedron. In our
experiments, the answers to queries are simulated using a weighting vector ω
randomly generated before running the algorithm, using the procedure presented
in [25], to guarantee a uniform distribution of the weights.

Implementation Details. Numerical tests were performed on a Intel Core i7-


7700, at 3.60 GHz, with a program written in C. At each iteration step of our
algorithm, the extreme points associated to the convex polyhedron ΩΘ are gen-
erated using the polymake library2 . Moreover, at each step, we do not compute
PMR values using a linear programming solver. Instead, we only compute score
differences since the maximum value is always obtained for an extreme point of
the convex polyhedron. Furthermore, to reduce the number of PMR computa-
tions, we use Pareto dominance tests between the extreme points to eliminate
dominated solutions, as proposed in [20].

5.1 Multicriteria Spanning Tree

In these experiments, we consider instances of the multicriteria spanning tree


(MST) problem, which is defined by a connected graph G = (V, E) where each
edge e ∈ E is valued by a cost vector giving its cost with respect to different
criteria/objectives (every criterion is assumed to be additive over the edges). A
spanning tree of G is a connected sub-graph of G which includes every vertex
v ∈ V while containing no cycle. In this problem, X is the set of all spanning
trees of G. We generate instances of G = (V, E) with a number of vertices |V |
varying between 50 and 100 and a number of objectives n ranging from 2 to 6.
The edge costs are drawn within {1, . . . , 1000}n uniformly at random. For the
MST problem, procedure Optimizing(P, EPΘ ) proceeds as follows: First, for all
extreme points ω k ∈ EPΘ , an instance of the spanning tree problem with a single
objective is created by simply aggregating the edge costs of G using weights ω k .
Then, Prim algorithm is applied on the resulting graphs. The results obtained
by averaging over 30 runs are given in Table 1 for δ = 0.

2
https://polymake.org.
230 N. Benabbou and T. Lust

Table 1. MST: comparison of the different query strategies (best values in bold).

n |V | IEEP - Random IEEP - Max-Dist IEEP - CSS


time(s) queries eval qOpt time(s) queries eval qOpt time(s) queries eval qOpt
2 50 8.6 7.4 9.4 4.6 8.0 7.4 9.4 4.6 7.7 7.4 9.4 4.6
3 50 16.9 16.2 34.9 10.9 16.5 15.2 33.1 10.2 17.9 16.9 35.9 12.0
4 50 27.5 25.7 117.3 19.7 26.4 24.6 112.3 17.2 30.7 28.9 130.8 20.1
5 50 37.7 35.0 363.2 27.2 36.2 34.3 358.4 23.3 42.3 39.8 404.7 30.6
6 50 46.1 43.3 1056.3 35.3 45.5 42.7 1075.2 32.8 62.6 57.6 1537.9 43.6
2 100 10.0 8.6 10.6 5.7 8.9 8.6 10.6 5.7 9.2 8.6 10.6 5.7
3 100 18.7 17.6 37.8 14.0 19.0 17.4 37.2 13.9 19.0 17.7 37.7 13.0
4 100 32.0 29.9 134.0 23.3 30.1 28.4 129.9 22.3 34.8 32.5 147.0 24.1
5 100 41.8 39.8 404.4 31.3 42.1 39.2 411.5 31.0 55.9 51.7 564.8 40.6
6 100 55.9 51.5 1306.1 40.0 52.3 49.1 1259.3 38.7 84.0 75.7 2329.6 62.1

Running Time and Number of Evaluations. We observe that Random and Max-
Dist strategies are much faster than CSS strategy; for instance, for n = 6 and
|V | = 100, Random and Max-Dist strategies end before one minute whereas
CSS needs almost a minute and a half. Note that time is mostly consumed by
the generation of extreme points, given that the evaluations are performed by
Prim algorithm which is very efficient. Since the number of evaluations with CSS
drastically increases with the size of the problem, we may expect the performance
gap between CSS and the two other strategies to be much larger for MOCO
problems with a less efficient solving method.
Number of Generated Preference Queries. We can see that Max-Dist is the best
strategy for minimizing the number of generated preference queries. More pre-
cisely, for all instances, the preferred solution is detected with less than 40 queries
and the optimality is established after at most 50 queries. In fact, we can reduce
even further the number of preference queries by considering a strictly positive
tolerance threshold; to give an example, if we set δ = 0.1 (i.e. 10% of the “maxi-
mum” error computed using the ideal point and the worst objective vector), then
our algorithm combined with Max-Dist strategy generates at most 20 queries in
all considered instances. In Table 1, we also observe that CSS strategy generates
many more queries than Random, which is quite surprising since CSS strategy
is intensively used in incremental elicitation (e.g., [4,7]). To better understand
this result, we have plotted the evolution of minimax regret with respect to the
number of queries for the bigger instance of our set (|V | = 100, n = 6). We have
divided the figure in two parts: the first part is when the number of queries is
between 1 and 20 and the other part is when the number of queries is between
20 and 50 (see Fig. 6). In the first figure, we observe that there is almost no
difference between the three strategies, and the minimax regret is already close
to 0 after only 20 questions (showing that we are very close to the optimum
relatively quickly). However, there is a significant difference between the three
strategies in the second figure: the minimax regret with CSS starts to reduce
An Interactive Polyhedral Approach for MOCO Problems 231

Fig. 6. MST problem with n = 6 and |V | = 100: evolution of the minimax regret
between 1 and 20 queries (left) and between 21 and 50 queries (right).

less quickly after 30 queries, remaining strictly positive after 50 queries, whereas
the optimal solution is found after about 40 queries with the other strategies.
Thus, queries generated with CSS gradually becomes less and less informative
than those generated by the two other strategies. This can be explained by the
following: CSS always selects the minimax regret optimal solution and one of its
worst adversary. Therefore, when the minimax regret optimal solution does not
change after asking a query, the same solution is used for the next preference
query. This can be less informative than asking the DM to compare two solu-
tions for which we have no preference information at all; Random and Max-Dist
strategies select the two solutions to compare in a more diverse way.

Comparison with the State-of-the-Art Method. In this subsection, we


compare our interactive method with the state-of-the-art method proposed in [2].
The latter consists essentially in integrating incremental elicitation into Prim
algorithm [22]; therefore, this method will be called IE-Prim hereafter. The main
difference between IE-Prim and IEEP is that IE-Prim is constructive: queries
are not asked on complete solutions but on partial solutions (edges of the graph).
We have implemented ourselves IE-Prim, using the same programming language
and data structures than IEEP, in order to allow a fair comparison between these
methods. Although IE-Prim was only proposed and tested with CSS in [2], we
have integrated the two other strategies (i.e., Max-Dist and Random) in IE-Prim.
In Table 2, we compare IEEP with Max-Dist and IE-Prim in terms of running
times and number of queries3 . We see that IEEP outperforms IE-Prim in all set-
tings, allowing the running time and the number of queries to be divided by three
in our biggest instances. Note that Max-Dist and Random strategies improve
the performances of IE-Prim (compared to CSS), but it is still not enough to
achieve results comparable to IEEP. This shows that asking queries during the
3
Note that we cannot compute qOpt and eval for IE-Prim since it is constructive and
makes no evaluation.
232 N. Benabbou and T. Lust

Table 2. MST: comparison between IEEP and IE-Prim (best values in bold).

n |V | IEEP - Max-Dist IE-Prim - Random IE-Prim - Max-Dist IE-Prim - CSS


time(s) queries time(s) queries time(s) queries time(s) queries
2 50 8.0 7.4 13.3 12.3 12.1 11.2 13.0 12.3
3 50 16.5 15.2 28.6 26.7 26.1 24.5 31.9 29.6
4 50 26.4 24.6 45.0 42.1 42.5 39.7 55.6 50.8
5 50 36.2 34.3 59.7 55.5 56.9 53.2 80.4 73.4
6 50 45.5 42.7 78.7 73.4 79.4 73.5 117.8 108.1
2 100 8.9 8.6 15.9 15.1 14.6 13.6 16.1 15.0
3 100 19.0 17.4 34.6 32.4 33.6 31.1 36.9 35.3
4 100 30.1 28.4 55.6 51.6 54.7 51.2 66.6 61.6
5 100 42.1 39.2 75.4 70.7 76.4 71.7 103.7 95.3
6 100 52.3 49.1 103.7 96.0 100.3 93.5 162.3 146.2

construction of the solutions is less informative than asking queries using the
extreme points of the polyhedron representing the preference uncertainty.
Now we want to estimate the performances of our algorithm seen as an any-
time algorithm (see Fig. 7). For each iteration step i, we compute the error
obtained when deciding to return the solution that is optimal for the minimax
regret criterion at step i (i.e., after i queries); this error is here expressed in
terms of percentage from the optimal solution. For the sake of comparison, we
also include the results obtained with IE-Prim. However IE-Prim cannot be seen
as an anytime algorithm since it is constructive. Therefore, to vary the number
of queries, we used different tolerance thresholds: δ = 0.3, 0.2, 0.1, 0.05 and 0.01.

Fig. 7. MST problem with |V | = 100: Comparison of the errors with respect to the
number of queries for n = 3 (left) and for n = 6 (right).

In Fig. 7, we observe that the error drops relatively quickly for both proce-
dures. Note however that the error obtained with IE-Prim is smaller than with
An Interactive Polyhedral Approach for MOCO Problems 233

IEEP when the number of queries is very low. This may suggest to favor IE-Prim
over IEEP whenever the interactions are very limited and time is not an issue.

5.2 Multicriteria Traveling Salesman Problem

We now provide numerical results for the multicriteria traveling salesman prob-
lem (MTSP). In our tests, we consider existing Euclidean instances of the MTSP
with 50 and 100 cities, and n = 2 to 6 objectives4 . Moreover, we use the exact
solver Concorde5 to perform the optimization part of IEEP algorithm (see proce-
dure Optimizing). Contrary to the MST, there exist no interactive constructive
algorithms to solve the MTSP. Therefore, we only provide the results obtained by
our algorithm IEEP with the three proposed query generation strategies (namely
Random, Max-Dist and CSS). The results obtained by averaging over 30 runs
are given in Table 3 for δ = 0.
In this table, we see that Max-Dist remains the best strategy for minimizing
the number of generated preference queries. Note that the running times are
much higher for the MTSP than for the MST (see Table 1), as the traveling
salesman problem is much more difficult to solve exactly with known preferences.

Table 3. MTSP: comparison of the different query strategies (best values in bold)

n |V | IEEP - Random IEEP - Max-Dist IEEP - CSS


time(s) queries eval qOpt time(s) queries eval qOpt time(s) queries eval qOpt
2 50 8.0 6.3 8.3 3.7 8.8 6.3 8.3 3.7 10.0 6.3 8.3 3.7
3 50 21.2 14.3 31.3 10.0 23.5 13.3 29.5 9.5 24.5 14.9 32.4 10.6
4 50 38.7 22.6 101.5 16.0 50.2 20.7 93.6 16.2 67.7 24.2 109.1 16.9
5 50 210.9 31.2 331.1 22.7 95.1 28.6 304.8 19.2 137.1 38.5 387.7 23.9
6 50 390.8 41.0 1044.5 26.2 238.8 37.3 949.3 24.4 584.9 58.4 1531.0 28.9
2 100 12.2 7.6 9.6 4.3 11.3 7.6 9.6 4.3 19.1 7.6 9.6 4.3
3 100 28.3 15.9 34.7 12.4 27.3 15.4 33.7 12.1 42.2 16.5 35.6 11.7
4 100 73.1 26.7 121.1 20.0 69.9 25.4 115.8 18.1 94.8 28.4 124.9 19.6
5 100 241.9 36.4 380.6 27.3 237.0 35.5 383.0 24.4 361.8 44.7 481.3 31.2
6 100 981.2 45.0 1106.8 32.8 586.3 41.7 1014.5 30.2 1618.3 68.8 1865.3 39.2

6 Conclusion and Perspectives

In this paper, we have proposed a general method for solving multi-objective


combinatorial optimization problems with unknown preference parameters. The
method is based on a sharp combination of (1) regret-based incremental prefer-
ence elicitation and (2) the generation of promising solutions using the extreme

4
https://eden.dei.uc.pt/∼paquete/tsp/.
5
http://www.math.uwaterloo.ca/tsp/concorde.
234 N. Benabbou and T. Lust

points of the polyhedron representing the admissible preference parameters; sev-


eral query generation strategies have been proposed in order to improve its
performances. We have shown that our method returns the optimal solution
according to the DM’s preferences. Our method has been tested on the multicri-
teria spanning tree and multicriteria traveling salesman problems until 6 criteria
and 100 vertices. We have provided numerical results showing that our method
achieves better results than IE-Prim (the state-of-the-art method for the MST
problem) both in terms of number of preference queries and running times.
Thus, in practice, our algorithm outperforms IE-Prim which is an algorithm
that runs in polynomial time and generates no more than a polynomial number
of queries. However, our algorithm does not have these performance guarantees.
More precisely, the performances of our interactive method strongly depend on
the number of extreme points at each iteration step, which can be exponential
in the number of criteria (see e.g., [26]). Therefore, the next step could be to
identify an approximate representation of the polyhedron which guarantees that
the number of extreme points is always polynomial, while still being able to
determine a (near-)optimal solution according to the DM’s preferences.

References
1. Benabbou, N., Di Sabatino Di Diodoro, S., Perny, P., Viappiani, P.: Incremental
preference elicitation in multi-attribute domains for choice and ranking with the
Borda count. In: Schockaert, S., Senellart, P. (eds.) SUM 2016. LNCS (LNAI),
vol. 9858, pp. 81–95. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
45856-4 6
2. Benabbou, N., Perny, P.: On possibly optimal tradeoffs in multicriteria spanning
tree problems. In: Walsh, T. (ed.) ADT 2015. LNCS (LNAI), vol. 9346, pp. 322–
337. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23114-3 20
3. Benabbou, N., Perny, P.: Solving multi-agent knapsack problems using incremental
approval voting. In: Proceedings of ECAI 2016, pp. 1318–1326 (2016)
4. Benabbou, N., Perny, P.: Interactive resolution of multiobjective combinatorial
optimization problems by incremental elicitation of criteria weights. EURO J.
Decis. Process. 6(3–4), 283–319 (2018)
5. Benabbou, N., Perny, P., Viappiani, P.: Incremental elicitation of Choquet capaci-
ties for multicriteria choice, ranking and sorting problems. Artif. Intell. 246, 152–
180 (2017)
6. Bourdache, N., Perny, P.: Active preference elicitation based on generalized Gini
functions: application to the multiagent knapsack problem. In: Proceedings of
AAAI 2019 (2019)
7. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
8. Choquet, G.: Theory of capacities. Annales de l’Institut Fourier 5, 31–295 (1953)
9. Drummond, J., Boutilier, C.: Preference elicitation and interview minimization in
stable matchings. In: Proceedings of AAAI 2014, pp. 645–653 (2014)
10. Dyer, M., Proll, L.: An algorithm for determining all extreme points of a convex
polytope. Math. Program. 12–81 (1977)
An Interactive Polyhedral Approach for MOCO Problems 235

11. Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries
in interactive value iteration. In: Walsh, T. (ed.) ADT 2015. LNCS (LNAI), vol.
9346, pp. 139–152. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-
23114-3 9
12. Grabisch, M., Labreuche, C.: A decade of application of the Choquet and Sugeno
integrals in multi-criteria decision aid. Ann. Oper. Res. 175(1), 247–286 (2010)
13. Hamacher, H., Ruhe, G.: On spanning tree problems with multiple objectives. Ann.
Oper. Res. 52, 209–230 (1994)
14. Kaddani, S., Vanderpooten, D., Vanpeperstraete, J.M., Aissi, H.: Weighted sum
model with partial preference information: application to multi-objective optimiza-
tion. Eur. J. Oper. Res. 260, 665–679 (2017)
15. Karasakal, E., Köksalan, M.: Generating a representative subset of the nondomi-
nated frontier in multiple criteria. Oper. Res. 57(1), 187–199 (2009)
16. Korhonen, P.: Interactive methods. In: Figueira, J., Greco, S., Ehrogott, M. (eds.)
Multiple Criteria Decision Analysis: State of the Art Surveys. ISOR, vol. 78, pp.
641–661. Springer, New York (2005). https://doi.org/10.1007/0-387-23081-5 16
17. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling sales-
man problem. Proc. Am. Math. Soc. 7, 48–50 (1956)
18. Laumanns, M., Thiele, L., Deb, K., Zitzler, E.: Combining convergence and diver-
sity in evolutionary multiobjective optimization. Evol. Comput. 10(3), 263–282
(2002)
19. Lust, T., Rolland, A.: Choquet optimal set in biobjective combinatorial optimiza-
tion. Comput. OR 40(10), 2260–2269 (2013)
20. Marinescu, R., Razak, A., Wilson, N.: Multi-objective constraint optimization with
tradeoffs. In: Schulte, C. (ed.) CP 2013. LNCS, vol. 8124, pp. 497–512. Springer,
Heidelberg (2013). https://doi.org/10.1007/978-3-642-40627-0 38
21. Marinescu, R., Razak, A., Wilson, N.: Multi-objective influence diagrams with
possibly optimal policies. In: Proceedings of AAAI 2017, pp. 3783–3789 (2017)
22. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst.
Tech. J. 36, 1389–1401 (1957)
23. White III, C.C., Sage, A.P., Dozono, S.: A model of multiattribute decisionmaking
and trade-off weight determination under uncertainty. IEEE Trans. Syst. Man
Cybern. 14(2), 223–229 (1984)
24. Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision
processes. In: Proceedings of IJCAI 2011, pp. 2159–2164 (2011)
25. Rubinstein, R.: Generating random vectors uniformly distributed inside and on
the surface of different regions. Eur. J. Oper. Res. 10(2), 205–209 (1982)
26. Schrijver, A.: Combinatorial Optimization - Polyhedra and Efficiency. Springer,
Heidelberg (2003)
27. Wang, T., Boutilier, C.: Incremental utility elicitation with the minimax regret
decision criterion, pp. 309–316 (2003)
28. Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes
with unknown rewards. In: Proceedings of IJCAI 2013, pp. 2415–2421 (2013)
29. Wiecek, M.M.: Advances in cone-based preference modeling for decision making
with multiple criteria. Decis. Making Manuf. Serv. 1(1–2), 153–173 (2007)
30. Wilson, N., Razak, A., Marinescu, R.: Computing possibly optimal solutions for
multi-objective constraint optimisation with tradeoffs. In: Proceedings of IJCAI
2015, pp. 815–822 (2015)
Open-Mindedness of Gradual
Argumentation Semantics

Nico Potyka(B)

Institute of Cognitive Science, University of Osnabrück, Osnabrück, Germany


[email protected]

Abstract. Gradual argumentation frameworks allow modeling argu-


ments and their relationships and have been applied to problems like
decision support and social media analysis. Semantics assign strength
values to arguments based on an initial belief and their relationships.
The final assignment should usually satisfy some common-sense prop-
erties. One property that may currently be missing in the literature is
Open-Mindedness. Intuitively, Open-Mindedness is the ability to move
away from the initial belief in an argument if sufficient evidence against
this belief is given by other arguments. We generalize and refine a pre-
viously introduced notion of Open-Mindedness and use this definition to
analyze nine gradual argumentation approaches from the literature.

Keywords: Gradual argumentation · Weighted argumentation ·


Semantical properties

1 Introduction
The basic idea of abstract argumentation is to study the acceptability of argu-
ments abstracted from their content, just based on their relationships [13]. While
arguments can only be accepted or rejected under classical semantics, gradual
argumentation semantics consider a more fine-grained scale between these two
extremes [3,6–8,10,16,20,22]. Arguments may have a base score that reflects a
degree of belief that the argument is accepted when considered independent of
all the other arguments. Semantics then assign strength values to all arguments
based on their relationships and the base score if provided.
Of course, strength values should not be assigned in an arbitrary manner, but
should satisfy some common-sense properties. Baroni, Rago and Toni recently
showed that 29 properties from the literature can be reduced to basically two
fundamental properties called Balance and Monotonicity [8] that we will discuss
later. Balance and Monotonicity already capture a great deal of what we should
expect from strength values of arguments, but they do not (and do not attempt
to) capture everything. One desiderata that may be missing in many applications
is Open-Mindedness. To illustrate the idea, suppose that we evaluate arguments
by strength values between 0 and 1, where 0 means that we fully reject and 1
means that we fully accept an argument. Then, as we increase the number of
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 236–249, 2019.
https://doi.org/10.1007/978-3-030-35514-2_18
Open-Mindedness of Gradual Argumentation Semantics 237

Fig. 1. Argument attacked by N other arguments.

supporters of an argument while keeping everything else equal, we should expect


that its strength steadily approaches 1. Symmetrically, as we increase the number
of attackers of an argument, we should expect that its strength approaches 0.
To illustrate this, consider the graph in Fig. 1 that shows an argument A that is
initially accepted (base score 1), but has N attackers that are initially accepted
as well. For example, we could model a trial in law, where A corresponds to the
argument that we should find the accused not guilty because we do not want to
convict an innocent person. The N attackers correspond to pieces of evidence
without reasonable doubt. Then, as N grows, we should expect that the strength
of A goes to 0. Similar, in medical diagnosis, it is reasonable to initially accept
that a patient with an infectious disease has a common cold because this is
usually the case. However, as the number of symptoms for a more serious disease
grows, we should be able to reject our default diagnosis at some point. Of course,
we should expect a dual behaviour for support relations: if we initially reject A
and have N supporters that are initially accepted, we should expect that the
strength of A goes to 1 as N increases. A gradual argumentation approach that
respects this idea is called open-minded. Open-Mindedness may not be necessary
in every application, but it seems natural in many domains. Therefore, our goal
here is to investigate which gradual argumentation semantics from the literature
respect this property.

2 Compact QBAFs, Balance and Monotonicity


In our investigation, we consider quantitative bipolar argumentation frameworks
(QBAFs) similar to [8]. However, for now, we will restrict to frameworks that
assign values from a compact real interval to arguments in order to keep the
formalism simple. At the end of this article, we will explain how the idea can be
extended to more general QBAFs.
Definition 1 (Compact QBAF). Let D be a compact real interval. A QBAF
over D is a quadruple (A, Att, Sup, β) consisting of a set of arguments A, two
binary relations Att and Sup called attack and support and a function β : A → D
that assigns a base score β(a) to every argument a ∈ A.
Typical instantiations of the interval D are [0, 1] and [−1, 1]. Sometimes non-
compact intervals like open or unbounded intervals are considered as well, but
we exclude these cases for now. We can consider different subclasses of QBAFs
that use only some of the possible building blocks [8]. Among others, we will
look at subclasses that contain QBAFs of the following restricted forms:
238 N. Potyka

Attack-only: (A, Att, Sup, β) where Sup = ∅,


Support-only: (A, Att, Sup, β) where Att = ∅,
Bipolar without Base Score: (A, Att, Sup, β) where β is a constant function.

In order to interpret a given QBAF, we want to assign strength values to every


argument. The strength values should be connected in a reasonable way to the
base score of an argument and the strength of its attackers and supporters. Of
course, this can be done in many different ways. However, eventually we want a
function that assigns a strength value to every argument.
Definition 2 (QBAF interpretation). Let Q = (A, Att, Sup, β) be a QBAF
over a real interval D. An interpretation of Q is a function σ : A → D and σ(a)
is called the strength of a for all a ∈ A.
Gradual argumentation semantics can define interpretations for the whole class
of QBAFs or for a subclass only. One simple example is the h-categorizer seman-
tics from [10] that interprets only acyclic attack-only QBAFs without base score.
For all a ∈ A, the h-categorizer semantics defines σ(a) = 1+ 1 . That
(b,a)∈Att σ(b)
is, unattacked arguments have strength 1, and the strength of all other argu-
ments decreases monotonically based on the strength of their attackers. Since it
only interprets acyclic QBAFs, the strength values can be evaluated in topologi-
cal order, so that the strength values of all parents are known when interpreting
the next argument.
Of course, we do not want to assign final strength values in an arbitrary way.
Many desirable properties for different families of QBAFs have been proposed
in the literature, see, e.g., [2–4,16,22]. Dependent on whether base scores, only
attack, only support or both relations are considered, different properties have
been proposed. However, as shown in [8], most properties can be reduced to
basically two fundamental principles that are called Balance and Monotonicity.
Roughly speaking, Balance says that the strength of an argument should be
equal to its base score if its attackers and supporters are equally strong and that
it should be smaller (greater) if the attackers are stronger (weaker) than the
supporters. Monotonicity says, intuitively, that if the same impact (in terms of
base score, attack and support) acts on two arguments a1 , a2 , then they should
have the same strength, whereas if the impact on a1 is more positive, it should
have a larger strength than a2 . Several variants of Balance and Monotonicity
have been discussed in [8]. For example, the stronger-than relationship between
arguments can be formalized in a qualitative (focusing on the number of attackers
and supporters) or quantitative manner (focusing on the strength of attackers
and supporters). We refer to [8] for more details.

3 Open-Mindedness

Intuitively, it seems that Balance and Monotonicity could already imply Open-
Mindedness. After all, they demand that adding attacks (supports) increases
(decreases) the strength in a sense. However, this is not sufficient to guarantee
Open-Mindedness of Gradual Argumentation Semantics 239

that the strength can be moved arbitrarily close to the boundary values. To
illustrate this, let us consider the Euler-based semantics that has been introduced
for the whole class of QBAFs in [4]. Strength values are defined by

1 − β(a)2
σ(a) = 1 −  
1 + β(a) · exp( (b,a)∈Sup σ(b) − (b,a)∈Att σ(b))

Note that if there are no attackers or supporters, the strength becomes just
1 − (1+β(a))(1−β(a))
1+β(a)·1 = β(a). If the strength of a’s attackers accumulates to a
larger (smaller) value than the strength of a’s supporters, the strength will be
smaller (larger) than the base score. The Euler-based semantics satisfies the
basic Balance and Monotonicity properties in most cases, see [4] for more details.
However, it does not satisfy Open-Mindedness as has been noted in [21] already.
There are two reasons for this. The first reason is somewhat weak and regards
2
the boundary case β(a) = 0. In this case, the strength becomes 1 − 1−0 1+0 = 0
independent of the supporters. In this boundary case, the Euler-based semantics
does not satisfy Balance and Monotonicity either. The second reason is more
profound and corresponds to the fact that the exponential function always yields
2
positive values. Therefore, 1 + β(a) · exp(x) ≥ 1 and σ(a) ≥ 1 − 1−β(a)
1 = β(a)2
independent of the attackers. Hence, the strength value can never be smaller
than the base score squared. The reason that the Euler-based semantics can still
satisfy Balance and Monotonicity is that the limit β(a)2 can never actually be
taken, but is only approximated as the number of attackers goes to infinity.
Hence, Open-Mindedness is indeed a property that is currently not captured
by Balance and Monotonicity. To begin with, we give a formal definition for a
restricted case. We assume that larger values in D are stronger to avoid tedious
case differentiations. This assumption is satisfied by the first eight semantics
that we consider. We will give a more general definition later that also makes
sense when this assumption is not satisfied. Open-Mindedness includes two dual
conditions, one for attack- and one for support-relations. Intuitively, we want
that in every QBAF, the strength of every argument with arbitrary base score
can be moved arbitrarily close to min(D) (max(D)) if we only add a sufficient
number of strong attackers (supporters). In the following definition,  captures
the closeness and N the sufficiently large number.
Definition 3 (Open-Mindedness). Consider a semantics that defines an
interpretation σ : A → D for every QBAF from a particular class F of QBAFs
over a compact interval D. We call the semantics open-minded if for every QBAF
(A, Att, Sup, β) in F, for every argument a ∈ A and for every  > 0, the fol-
lowing condition is satisfied: there is an N ∈ N such that when adding N new
arguments AN = {a1 , . . . , aN }, A ∩ AN = ∅, with maximum base score, then
1. if F allows attacks, then for (A ∪ AN , Att ∪ {(ai , a) | 1 ≤ i ≤ N }, Sup, β  ),
we have |σ(a) − min(D)| <  and
2. if F allows supports, then for (A ∪ AN , Att, Sup ∪ {(ai , a) | 1 ≤ i ≤ N }, β  ),
we have |σ(a) − max(D)| < ,
where β  (b) = β(b) for all b ∈ A and β  (ai ) = max(D) for i = 1, . . . , n.
240 N. Potyka

Some explanations are in order. Note that we do not make any assumptions
about the base score of a in Definition 3. Hence, we demand that the strength
of a must become arbitrary small (large) within the domain D, no matter what
its base score is. One may consider a weaker notion of Open-Mindedness that
excludes the boundary base scores for a. However, this distinction does not make
a difference for our investigation and so we will not consider it here. Note also
that we do not demand that the strength of a ever takes the extreme value
max(D) (min(D)), but only that it can become arbitrarily close. Finally note
that item 1 in Definition 3 is trivially satisfied for support-only QBAFs, and item
2 for attack-only QBAFs.

3.1 Attack-Only QBAFs over D = [0, 1]


In this section, we consider three semantics for attack-only QBAFs over D =
[0, 1]. Recall from Sect. 2 that the h-categorizer semantics from [10] inter-
prets acyclic attack-only QBAFs without base score. The definition has been
extended to arbitrary (including cycles) attack-only QBAFs and base scores from
D = [0, 1] in [6]. The strength of an argument under the weighted h-categorizer
semantics is then defined by
β(a)
σ(a) =  (1)
1+ (b,a)∈Att σ(b)
for all a ∈ A. Note that the original definition of the h-categorizer seman-
tics from [10] is obtained when all base scores are 1. The strength values in
(cyclic) graphs can be computed by initializing the strength values with the
base scores and applying formula (1) repeatedly to all arguments simultaneously
until the strength values converge [6]. It is not difficult to see that the weighted h-
categorizer semantics satisfies Open-Mindedness. However, in order to illustrate
our definition, we give a detailed proof of the claim.
Proposition 1. The weighted h-categorizer semantics is open-minded.
Proof. In the subclass of attack-only QBAFs, it suffices to check the first condi-
tion of Definition 3. Consider an arbitrary attack-only QBAF (A, Att, ∅, β), an
arbitrary argument a ∈ A and an arbitrary  > 0. Let N = 1
+ 1 and consider
the QBAF (A ∪ {a1 , . . . , aN }, Att ∪ {(ai , a) | 1 ≤ i ≤ N }, Sup, β  ) as defined
in Definition 3. Recall that the N new attackers {a1 , . . . , aN } have base score
1 and do not have any attackers. Therefore, σ(ai ) = β(a i)
= 1 for i = 1, . . . , n
 N 1
and (b,a)∈Att σ(a) ≥ i=1 σ(ai ) = N . Furthermore, we have β(a) ≤ 1 because
D = [0, 1]. Hence, |σ(a) − 0| =  β(a) < 1
< .

1+ (b,a)∈Att σ(a) N

The weighted max-based semantics from [6] can be seen as a variant of the
h-categorizer semantics that aggregates the strength of attackers by means of
the maximum instead of the sum. The strength of arguments is defined by
β(a)
σ(a) = . (2)
1 + max(b,a)∈Att σ(b)
Open-Mindedness of Gradual Argumentation Semantics 241

If there are no attackers, the maximum yields 0 by convention. The motivation


for using the maximum is to satisfy a property called Quality Precedence, which
guarantees that when arguments a1 and a2 have the same base score, but a1 has
an attacker that is stronger than all attackers of a2 , then the strength of a1 must
be smaller than the strength of a2 . The strength values under the weighted max-
based semantics can again be computed iteratively [6]. Since all strength values
are in [0, 1] and the maximum is used for aggregating the strength values, we can
immediately see that σ(a) ≥ β(a)2 . Therefore, the weighted max-based semantics
is clearly not open-minded. For example, if β(a) = 1, the final strength cannot
be smaller than 12 , no matter how many attackers there are.
Proposition 2. The weighted max-based semantics is not open-minded.
One may wonder if Quality Precedence and Open-Mindedness are incompatible.
This is actually not the case. For example, when defining strength values by
 
σ(a) = β(a) · 1 − max σ(b)
(b,a)∈Att

both Quality Precedence and Open-Mindedness are satisfied. In particular, the


strength now decreases linearly from β(a) to 0 with respect to the strongest
attacker, which makes this perhaps a more natural way to satisfy Quality Prece-
dence when it is desired.
The weighted card-based semantics from [6] is another variant of the h-
categorizer semantics. Instead of putting extra emphasis on the strength of
attackers, it now puts extra emphasis on the number of attackers. Let Att+ =
{(a, b) ∈ Att | β(a) > 0}. Then the strength of arguments is defined by
β(a)
σ(a) = 
σ(b)
. (3)
(b,a)∈Att+
1 + |Att+ | + |Att+ |

When reordering terms in the denominator, we can see that the only difference to
the h-categorizer semantics is that every attacker b with non-zero strength adds
1+σ(b) instead of just σ(b) in the sum in the denominator (attacker with strength
0 do not add anything anyway). This enforces a property called Cardinality
Precedence, which basically means that when arguments a1 and a2 have the
same base score and a1 has a larger number of non-rejected attackers (σ(b) > 0)
than a2 , then the strength of a1 must be smaller than the strength of a2 . The
strength values under the weighted card-based semantics can again be computed
iteratively [6]. Analogously to the weighted h-categorizer semantics, it can be
checked that the weighted card-based semantics satisfies Open-Mindedness.
Proposition 3. The weighted card-based semantics is open-minded.

3.2 Support-Only QBAFs over D = [0, 1]


We now consider three semantics for support-only QBAFs over D = [0, 1]. For
all semantics, the strength of arguments is defined by equations of the form
σ(a) = β(a) + (1 − β(a)) · S(a),
242 N. Potyka

where S(a) is an aggregate of the strength of a’s supporters. Therefore, the ques-
tion whether Open-Mindedness is satisfied boils down to the question whether
S(a) converges to 1 as we keep adding supporters.
The top-based semantics from [3] defines the strength of arguments by

σ(a) = β(a) + (1 − β(a)) max σ(b). (4)


(b,a)∈Sup

If there are no supporters, the maximum again yields 0 by convention. Similar


to the semantics in the previous section, the strength values can be computed
iteratively by setting the initial strength values to the base score and applying
formula (4) repeatedly until the values converge [3]. It is easy to check that the
top-based semantics is open-minded. In fact, a single supporter with strength
1 is sufficient to move the strength all the way to 1 independently of the base
score.
Proposition 4. The top-based semantics is open-minded.
The aggregation-based semantics from [3] defines the strength of arguments
by the formula

(b,a)∈Sup σ(b)
σ(a) = β(a) + (1 − β(a))  . (5)
1 + (b,a)∈Sup σ(b)

The strength values can again be computed iteratively [6]. It is easy to check
that the aggregation-based semantics is open-minded. Just note that the fraction
N
in (5) has the form 1+N and therefore approaches 1 as N → ∞. Therefore, the
strength of an argument will go to 1 as we keep adding supporters under the
aggregation-based semantics.
Proposition 5. The aggregation-based semantics is open-minded.
The reward-based semantics from [3] is based on the idea of founded argu-
ments. An argument a is called founded if there exists a sequence of argu-
ments (a0 , . . . , an ) such that an = a, (ai−1 , ai ) ∈ Sup for i = 1, . . . , n and
β(a0 ) > 0. That is, a has non-zero base score or is supported by a sequence
of supporters such that the first argument in the sequence has a non-zero
base score. Intuitively, this implies that a must have non-zero strength. We let
Sup+ = {(a, b) ∈ Sup | a is founded} denote the founded supports. For every
a ∈ A, welet N (a) = |Sup+ | denote the number of founded supporters of a and
+ σ(b)
M (a) = (b,a)∈Sup
N (A) the mean strength of the founded supporters. Then the
strength of a is defined as
N (a)−1
  1 M (a) 
σ(a) = β(a) + (1 − β(a)) i
+ N (a) . (6)
i=1
2 2

The strength values can again be computed iteratively [6]. As we show next, the
reward-based semantics also satisfies Open-Mindedness.
Open-Mindedness of Gradual Argumentation Semantics 243

Proposition 6. The reward-based semantics is open-minded.

Proof. In the subclass of support-only QBAFs, it suffices to check the second


N (a)−1 1
condition of Definition 3. Let us first note that i=1 2i is a geometric sum
without the first term and therefore evaluates to
1− 1
2N (a) 1
−1=1−
1 − 12 2N (a)−1

Note that this term already goes to 1 as the number of founded supporters
 N (a)
M (a) (b,a)∈Sup+ σ(b)
increases. We additionally add the non-negative term 2N (a)
= N (A)·2N (a)
 N (a)−1 1
which is bounded from above by 2N1(a) . Therefore, the factor i=1 2i +
M (a) 
2N (a)
is always between 0 and 1 and approaches 1 as |N (A)| → ∞.
To complete the proof, consider any support-only QBAF (A, ∅, Sup, β), any
argument a ∈ A, any  > 0 and let (A∪{a1 , . . . , aN }, Att, Sup∪{(ai , a) | 1 ≤ i ≤
N }, β  ) be the QBAF defined in Definition 3 for some N ∈ N. Note that every
argument in {a1 , . . . , aN } is a founded supporter of a. Therefore, N (A) ≥ N and
σ(a) → β(a) + (1 − β(a)) = 1 as N → ∞. This then implies that there exists an
N0 ∈ N such that |σ(a) − 1| < .

3.3 Bipolar QBAFs Without Base Score over D = [−1, 1]


In this section, we consider two semantics for bipolar QBAFs without base score
over D = [−1, 1] that have been introduced in [7]. It has not been explained how
the strength values are computed in [7]. However, given an acyclic graph, the
strength values can again be computed in topological order because the strength
of every argument depends only on the strength of its parents. For cyclic graphs,
one may consider an iterative procedure as before, but convergence may be an
issue. In our investigation, we will just assume that the strength values are well-
defined.
Following [8], we call the first semantics from [7], the loc-max semantics. It
defines strength values by the formula
max(b,a)∈Sup σ(b) − max(b,a)∈Att σ(b)
σ(a) = (7)
2
By convention, the maximum now yields −1 if there are no supporters/attackers
(this is consistent with the previous conventions in that −1 is now the minimum
of the domain, whereas the minimum was 0 before). If a has neither attack-
ers nor supporters, then σ(a) = −1−(−1)2 = 0. As we keep adding supporters
(attackers), the first (second) term in the numerator will take the maximum
strength value. From this we can see that the loc-sum semantics is open-minded
for attack-only QBAFs without base score and for support-only QBAFs with-
out base score. However, it is not open-minded for bipolar QBAFs without base
score. For example, suppose that a has a single supporter b , which has a single
supporter b and no attackers. Further assume that b has neither attackers nor
244 N. Potyka

1
−max σ(b)
supporters, so that σ(b ) = 0, σ(b ) = 0−(−1)
2 = 12 and σ(a) ≥ 2 (b,a)∈Att
2 .
Since the maximum of the attackers can never become larger than 1, we have
1
−1
σ(a) ≥ 2 2 ≥ − 14 , no matter how many attackers we add. Thus, the first condi-
tion of Open-Mindedness is violated. Using a symmetrical example, we can show
that the second condition can be violated as well.
Proposition 7. The loc-max semantics is not open-minded. It is open-minded
when restricting to attack-only QBAFs without base score or to support-only
QBAFs without base score.
Following [8], we call the second semantics from [7], the loc-sum semantics.
It defines strength values by the formula
1 1
σ(a) =  σ(b)+1
−  σ(b)+1
(8)
1+ (b,a)∈Att 2 1+ (b,a)∈Sup 2

Note that if there are neither attackers nor supporters, then both fractions are
1 such that their difference is just 0. As we keep adding attackers (supporters),
the first (second) fraction goes to 0. It follows again that the loc-sum semantics
is open-minded for attack-only QBAFs without base score and for support-only
QBAFs without base score. However, it is again not open-minded for bipolar
QBAFs without base score. For example, if a has a single supporter b that has
neither attackers nor supporters, then σ(b ) = 0 and the second fraction evaluates
to 1+1 1 = 23 . As we keep adding attackers, the first fraction will to 0 so that the
2
strength of a will converge to − 23 rather than to −1 as the first condition of
Open-Mindedness demands. It is again easy to construct a symmetrical example
to show that the second condition of Open-Mindedness can be violated as well.
Proposition 8. The loc-sum semantics is not open-minded. It is open-minded
when restricting to attack-only QBAFs without base score or to support-only
QBAFs without base score.

4 General QBAFs and Open-Mindedness

We now consider the general form of QBAFs as introduced in [8]. The domain
D = (S, ) is now an arbitrary set along with a preorder , that is, a reflexive
and transitive relation over S. We further assume that there is an infimum inf(S)
and a supremum sup(S) that may or may not be contained in S. For example,
the open interval (0, ∞), contains neither its infimum 0 nor its supremum ∞,
whereas the half-open interval [0, ∞) contains its infimum, but not its supremum.

Definition 4 (QBAF). A QBAF over D = (S, ) is a quadruple


(A, Att, Sup, β) consisting of a set of arguments A, a binary attack relation Att,
a binary support relation Sup and a function β : A → D that assigns a base
score β(a) to every argument a ∈ A.
Open-Mindedness of Gradual Argumentation Semantics 245

We now define a generalized form of Open-Mindedness for general QBAFs. We


have to take account of the fact that there may no longer exist a minimum
or maximum of the set. So instead we ask that strength values can be made
smaller/larger than every element from S \{inf(S), sup(S)} by adding a sufficient
number of attackers/supporters. Intuitively, we want to add strong supporters.
In Definition 3, we just assumed that the maximum corresponds to the strongest
value, but there are semantics that regard smaller values as stronger and, again,
S may neither contain a maximal nor a minimal element. Therefore, we will just
demand that there is some base score s∗ , such that adding attackers/supporters
with base score s∗ has the desired consequence.
Definition 5 (Open-Mindedness (General Form)). Consider a semantics
that defines an interpretation σ : A → D for every QBAF from a particular
class F of QBAFs over D = (S, ). We call the semantics open-minded if for
every QBAF (A, Att, Sup, β) in F, for every argument a ∈ A and for every
s ∈ S \ {inf(S), sup(S)}, the following condition is satisfied: there is an N ∈ N
and an s∗ ∈ S such that when adding N new arguments AN = {a1 , . . . , aN },
A ∩ AN = ∅, then
1. if F allows attacks, then for (A ∪ AN , Att ∪ {(ai , a) | 1 ≤ i ≤ N }, Sup, β  ),
we have σ(a)  s and
2. if F allows supports, then for (A ∪ AN , Att, Sup ∪ {(ai , a) | 1 ≤ i ≤ N }, β  ),
we have s  σ(a),
where β  (b) = β(b) for all b ∈ A and β  (ai ) = s∗ for i = 1, . . . , n.
Note that if S is a compact real interval, s ∈ S \ {inf(S), sup(S)} can be chosen
arbitrarily close to sup(S) = max(S) or inf(S) = min(S), so that s in Defini-
tion 5 plays the role of  in Definition 3. In particular, if Definition 3 is satisfied,
Definition 5 can be satisfied as well for an arbitrary s by choosing base score
s∗ = max(S) and choosing N with respect to  = max(S)−s 2 or  = s−min(S)
2 .
Definitions 5 and 3 are actually equivalent for compact real intervals provided
that max(S) corresponds to the strongest initialization of the base score under
the given semantics, which is indeed the case in all previous examples.
As an example, for more general QBAFs, let us now consider the α-burden-
semantics from [5]. It defines strength values for attack-only QBAFs without
base score over the half-open interval [1, ∞). As opposed to our previous exam-
ples, the minimum 1 now corresponds to the strongest value and increasing values
correspond to less plausibility. The α-burden-semantics defines strength values
via the formula   1  α1
σ(a) = 1 + . (9)
(σ(b))α
(b,a)∈Att

α is called the burden-parameter and can be used to modify the semantics, see [5]
for more details about the influence of α. For α ∈ [1, ∞) ∪ {∞}, (9) is equivalent
to arranging the reciprocals of strength values of all attackers in a vector v and
   p1
p
to take the p-norm vp = i vi of this vector with respect to p = α
246 N. Potyka

and adding 1. Popular examples of p-norms are the Manhattan-, Euclidean- and
Maximum-norm that are obtained for p = 1, p = 2 and the limit-case p = ∞,
respectively. An unattacked argument has just strength 1 under the α-burden-
1
semantics. Hence, when adding N new attackers to a, we have σ(a) ≥ 1+N α for
α ∈ [1, ∞). Hence, the α-burden-semantics is clearly open-minded in this case,
even though it becomes more conservative as α increases. In particular, for the
limit case α = ∞, it is not open-minded. This can be seen from the observation,
that the second term in (9) now corresponds to the maximum norm. Since the
strength of each attacker is in [1, ∞), their reciprocals are in (0, 1]. Therefore,
σ(a) ≤ 2 independent of the number of attackers of a.
Proposition 9. The α-burden-semantics is open-minded for α ∈ [1, ∞), but is
not open-minded for α = ∞.

5 Related Work
Gradual argumentation has become a very active research area and found appli-
cations in areas like information retrieval [24], decision support [9,22] and social
media analysis [1,12,16]. Our selection of semantics followed the selection in [8].
One difference is that we did not consider social abstract argumentation [16]
here. The reason is that social abstract argumentation has been formulated in a
very abstract form, which makes it difficult to formulate interesting conditions
under which Open-Mindedness is guaranteed. Instead, we added the α-burden-
semantics from [5] because it gives a nice example for a more general semantics
that neither uses strength values from a compact interval nor regards larger
values as stronger.
The authors in [8] also view ranking-based semantics [11] as gradual argu-
mentation frameworks. In their most general form, ranking-based semantics just
order arguments qualitatively, so that our notion of Open-Mindedness is not very
meaningful. A variant may be interesting, however, that demands, that in every
argumentation graph, every argument can become first or last in the order if
only a sufficient number of supporters or attackers is added to this argument.
However, in many cases, this notion of Open-Mindedness may be entailed by
other properties already. For example, Cardinality Precedence [11] states that if
argument a1 has more attackers than a2 , then a1 must be weaker than a2 . In
finite argumentation graphs, this already implies that a1 will be last in the order
if we add a sufficient number of attackers.
There are other quantitative argumentation frameworks like probabilis-
tic argumentation frameworks [14,15,17,19,23]. In this area, Open-Mindedness
would simply state that the probability of an argument must go to 0 (1) as we
keep adding attackers (supporters). It may be interesting to perform a similar
analysis for probabilistic argumentation frameworks.
An operational definition of Open-Mindedness for the class of modular seman-
tics [18] for weighted bipolar argumentation frameworks has been given in [21].
The Df-QuAD semantics [22] and the Quadratic-energy Semantics [20] satisfy
this notion of open-mindedness [21]. However, in case of DF-QuAD and some
Open-Mindedness of Gradual Argumentation Semantics 247

other semantics, this is actually counterintuitive because they cannot move the
strength of an argument towards 0 if there is a supporter with non-zero strength.
Indeed, DF-QuAD does not satisfy Open-Mindedness as defined here (every
QBAF with a non-zero strength supporter provides a counterexample). How-
ever, the quadratic energy model from [21] still satisfies the more restrictive
definition of Open-Mindedness that we considered here.
Another interesting property for bipolar QBAFs that is not captured by
Balance and Monotonicity is Duality [20]. Duality basically states that attack
and support should behave in a symmetrical manner. Roughly speaking, when
we convert an attack relation into a support relation or vice versa, the effect
of the relation should just be inverted. Duality is satisfied by the Df-QuAD
semantics [22] and the Quadratic-energy Semantics [20], but not by the Euler-
based semantics [4]. A formal analysis can be found in [20,21].

6 Conclusions

We investigated 9 gradual argumentation semantics from the literature. 5 of


them satisfy Open-Mindedness unconditionally. This includes the weighted h-
categorizer semantics and the weighted card-based semantics for attack-only
QBAFs from [6] and all three semantics for support-only QBAFs from [3]. The
α-burden-semantics for attack-only QBAFs without base score from [5] is open-
minded for α ∈ [1, ∞), but not for the limit case α = ∞. The loc-max seman-
tics and the loc-sum semantics for bipolar QBAFs without base score from [7]
are only open-minded when restricted to either attack-only or to support-only
QBAFs. Finally, the weighted max-based semantics for attack-only QBAFs from
[6] is not open-minded. However, as we saw, it can easily be adapted to satisfy
both Open-Mindedness and Quality Precedence.
In future work, it may be interesting to complement Open-Mindedness with
a Conservativeness property that demands that the original base scores are not
given up too easily. For the class of modular semantics [18] that iteratively com-
pute strength values by repeatedly aggregating strength values and combining
them with the base score, Conservativeness can actually be quantified analyti-
cally [21]. Intuitively, this can be done by analyzing the maximal local growth
of the aggregation and influence functions. There is actually an interesting rela-
tionship between Conservativeness and Well-Definedness of strength values. For
general QBAFs, procedures that compute strength values iteratively, can actu-
ally diverge [18] so that some strength values remain undefined. However, the
mechanics that make semantics more conservative, simultaneously improve con-
vergence guarantees [21]. In other words, convergence guarantees can often be
improved by giving up Open-Mindedness. The extreme case would be the naive
semantics that just assigns the base score as final strength to every argument
independent of the attackers and supporters. This semantics is clearly most con-
servative and always well-defined, but does not make much sense.
My personal impression is indeed that gradual argumentation semantics for
general QBAFs with strong convergence guarantees are too conservative at the
248 N. Potyka

moment. Some well-defined semantics for general QBAFs have been presented
recently in [18], but they are not open-minded. I am indeed unaware of any
semantics for general QBAFs that is generally well-defined and open-minded.
It is actually possible to define for every k ∈ N, an open-minded semantics
that is well-defined for all QBAFs where arguments have at most k parents.
One example is the 1-max(k) semantics, see Corollary 3.5 in [21]. However,
as k grows, these semantics become more and more conservative even though
they remain open-minded. More precisely, every single argument can change the
strength value of another argument by at most k1 , so that at least k arguments
are required to move the strength all the way from 0 to 1 and vice versa. A
better way to improve convergence guarantees may be to define strength values
not by discrete iterative procedures, but to replace them with continuous pro-
cedures that maintain the strength values in the limit, but improve convergence
guarantees [20,21]. However, while I find this approach promising, I admit that
it requires further analysis.
In conclusion, I think that Open-Mindedness is an interesting property that
is important for many applications. It is indeed satisfied by many semantics from
the literature. For others, like the weighted max-based semantics, we may be able
to adapt the definition. One interesting open question is whether we can define
semantics for general QBAFs that are generally well-defined and open-minded.

References
1. Alsinet, T., Argelich, J., Béjar, R., Fernández, C., Mateu, C., Planes, J.: Weighted
argumentation for analysis of discussions in Twitter. Int. J. Approximate Reason-
ing 85, 21–35 (2017)
2. Amgoud, L., Ben-Naim, J.: Axiomatic foundations of acceptability semantics. In:
International Conference on Principles of Knowledge Representation and Reason-
ing (KR), pp. 2–11 (2016)
3. Amgoud, L., Ben-Naim, J.: Evaluation of arguments from support relations:
axioms and semantics. In: International Joint Conferences on Artificial Intelligence
(IJCAI), p. 900 (2016)
4. Amgoud, L., Ben-Naim, J.: Evaluation of arguments in weighted bipolar graphs.
In: Antonucci, A., Cholvy, L., Papini, O. (eds.) ECSQARU 2017. LNCS (LNAI),
vol. 10369, pp. 25–35. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
61581-3 3
5. Amgoud, L., Ben-Naim, J., Doder, D., Vesic, S.: Ranking arguments with
compensation-based semantics. In: International Conference on Principles of
Knowledge Representation and Reasoning (KR) (2016)
6. Amgoud, L., Ben-Naim, J., Doder, D., Vesic, S.: Acceptability semantics for
weighted argumentation frameworks. In: IJCAI, vol. 2017, pp. 56–62 (2017)
7. Amgoud, L., Cayrol, C., Lagasquie-Schiex, M.C., Livet, P.: On bipolarity in argu-
mentation frameworks. Int. J. Intell. Syst. 23(10), 1062–1093 (2008)
8. Baroni, P., Rago, A., Toni, F.: How many properties do we need for gradual argu-
mentation? In: AAAI Conference on Artificial Intelligence (AAAI), pp. 1736–1743.
AAAI (2018)
Open-Mindedness of Gradual Argumentation Semantics 249

9. Baroni, P., Romano, M., Toni, F., Aurisicchio, M., Bertanza, G.: An
argumentation-based approach for automatic evaluation of design debates. In:
Leite, J., Son, T.C., Torroni, P., van der Torre, L., Woltran, S. (eds.) CLIMA
2013. LNCS (LNAI), vol. 8143, pp. 340–356. Springer, Heidelberg (2013). https://
doi.org/10.1007/978-3-642-40624-9 21
10. Besnard, P., Hunter, A.: A logic-based theory of deductive arguments. Artif. Intell.
128(1–2), 203–235 (2001)
11. Bonzon, E., Delobelle, J., Konieczny, S., Maudet, N.: A comparative study of
ranking-based semantics for abstract argumentation. In: AAAI Conference on Arti-
ficial Intelligence (AAAI), pp. 914–920 (2016)
12. Cocarascu, O., Rago, A., Toni, F.: Extracting dialogical explanations for review
aggregations with argumentative dialogical agents. In: International Conference on
Autonomous Agents and MultiAgent Systems (AAMAS), pp. 1261–1269. Interna-
tional Foundation for Autonomous Agents and Multiagent Systems (2019)
13. Dung, P.M.: On the acceptability of arguments and its fundamental role in non-
monotonic reasoning, logic programming and n-person games. Artif. Intell. 77(2),
321–357 (1995)
14. Hunter, A., Polberg, S., Potyka, N.: Updating belief in arguments in epistemic
graphs. In: International Conference on Principles of Knowledge Representation
and Reasoning (KR), pp. 138–147 (2018)
15. Hunter, A., Thimm, M.: Probabilistic reasoning with abstract argumentation
frameworks. J. Artif. Intell. Res. 59, 565–611 (2017)
16. Leite, J., Martins, J.: Social abstract argumentation. In: International Joint Con-
ferences on Artificial Intelligence (IJCAI), vol. 11, pp. 2287–2292 (2011)
17. Li, H., Oren, N., Norman, T.J.: Probabilistic argumentation frameworks. In: Mod-
gil, S., Oren, N., Toni, F. (eds.) TAFA 2011. LNCS (LNAI), vol. 7132, pp. 1–16.
Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29184-5 1
18. Mossakowski, T., Neuhaus, F.: Modular semantics and characteristics for bipolar
weighted argumentation graphs. arXiv preprint arXiv:1807.06685 (2018)
19. Polberg, S., Doder, D.: Probabilistic abstract dialectical frameworks. In: Fermé,
E., Leite, J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 591–599. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-11558-0 42
20. Potyka, N.: Continuous dynamical systems for weighted bipolar argumentation. In:
International Conference on Principles of Knowledge Representation and Reason-
ing (KR), pp. 148–157 (2018)
21. Potyka, N.: Extending modular semantics for bipolar weighted argumentation.
In: International Conference on Autonomous Agents and MultiAgent Systems
(AAMAS), pp. 1722–1730. International Foundation for Autonomous Agents and
Multiagent Systems (2019)
22. Rago, A., Toni, F., Aurisicchio, M., Baroni, P.: Discontinuity-free decision support
with quantitative argumentation debates. In: International Conference on Princi-
ples of Knowledge Representation and Reasoning (KR), pp. 63–73 (2016)
23. Rienstra, T., Thimm, M., Liao, B., van der Torre, L.: Probabilistic abstract argu-
mentation based on SCC decomposability. In: International Conference on Princi-
ples of Knowledge Representation and Reasoning (KR), pp. 168–177 (2018)
24. Thiel, M., Ludwig, P., Mossakowski, T., Neuhaus, F., Nürnberger, A.: Web-
retrieval supported argument space exploration. In: ACM SIGIR Conference on
Human Information Interaction and Retrieval (CHIIR), pp. 309–312. ACM (2017)
Approximate Querying on Property
Graphs

Stefania Dumbrava1(B) , Angela Bonifati2 , Amaia Nazabal Ruiz Diaz2 ,


and Romain Vuillemot3
1
ENSIIE Évry & CNRS Samovar, Évry, France
[email protected]
2
University of Lyon 1 & CNRS LIRIS, Lyon, France
{angela.bonifati,amaia.nazabal-ruiz-diaz}@univ-lyon1.fr
3
École Centrale Lyon & CNRS LIRIS, Lyon, France
[email protected]

Abstract. Property graphs are becoming widespread when modeling


data with complex structural characteristics and enhancing edges and
nodes with a list of properties. In this paper, we focus on the approxi-
mate evaluation of counting queries involving recursive paths on property
graphs. As such queries are already difficult to evaluate over pure RDF
graphs, they require an ad-hoc graph summary for their approximate
evaluation on property graphs. We prove the intractability of the opti-
mal graph summarization problem, under our algorithm’s conditions.
We design and implement a novel property graph summary suitable for
the above queries, along with an approximate query evaluation module.
Finally, we show the compactness of the obtained summaries as well as
the accuracy of answering counting recursive queries on them.

1 Introduction
A tremendous amount of information stored in the LOD can be inspected, by
leveraging the already mature query capabilities of SPARQL, relational, and
graph databases [14]. However, arbitrarily complex queries [2,3,7], entailing
rather intricate, possibly recursive, graph patterns prove difficult to evaluate,
even on small-sized graph datasets [4,5]. On the other hand, the usage of these
queries has radically increased in real-world query logs, as shown by recent empir-
ical studies on SPARQL queries from large-scale Wikidata and DBPedia corpuses
[8,17]. As a tangible example of this growth, the percentage of SPARQL prop-
erty paths has increased from 15% to 40%, from 2017 to beginning 2018 [17], for
user-specified Wikidata queries. In this paper, we focus on regular path queries
(RPQs) that identify paths labeled with regular expressions and aim to offer
an approximate query evaluation solution. In particular, we consider counting
queries with regular paths, which are a notable fragment of graph analytical
queries. The exact evaluation of counting queries on graphs is #P −complete
[21] and is based on another result on enumeration of simple graph paths.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 250–265, 2019.
https://doi.org/10.1007/978-3-030-35514-2_19
Approximate Querying on Property Graphs 251

Due to this intractability, an efficient and highly-accurate approximation of these


queries is desirable, which we address in this paper.
Approximate query processing on relational data and the related sampling
methods are not applicable to graphs, since the adopted techniques are based on
the linearity assumption [15], i.e., the existence of a linear relationship between
the sample size and execution time, typical of relational query processing. As
such, we design a novel query-driven graph summarization approach tailored for
property graphs. These significantly differ from RDF and relational data models,
as they attach data values to property lists on both nodes and edges [7].
To the best of our knowledge, ours is the first work on approximate prop-
erty graph analytics addressing counting estimation on top of navigational graph
queries. We illustrate our query fragment with the running example below.

Example 1 (Social Network Advertising). Let GSN (see Fig. 1) be a property


graph (see Sect. 2) encoding a social network, whose schema is inspired by the
LDBC benchmark [12]1 . Entities are people (type Person, Pi ) that know (l0 )
and/or follow (l1 ) either each other or certain forums (type Forum, Fi ). These
are moderated (l2 ) by specific persons and can contain (l3 ) messages/ads (type
Message, Mi ), to which persons can author (l4 ) other messages in reply (l5 ).
We focus on a RPQ [3,23] dialect with counting, capturing following query
types (Q1 − Q7 ) (see Fig. 2): (1) Simple/Optional Label. The number of pairs
satisfying Q1 , i.e., () −
→ l5 (), counts the ad reactions, while that for Q2 , i.e.,
() −
→ l2 ?(), indicates the number of potential moderators. (2) Kleene Plus/Kleene
Star. The number of the connected /potentially connected acquaintances is the
count of node pairs satisfying Q3 , i.e., () ← l0+ (), respectively, Q4 , i.e., () ← l0∗ ().
(3) Disjunction. The number of the targeted subscribers is the sum of counting
l4 l
1
all node pairs satisfying Q5 , i.e., ()←−() or ()←−(). (4) Conjunction. The direct
reach of a company via its page ads is the count of node pairs satisfying Q6 , i.e.,
l4
()←−() −
→ l5 (). (5) Conjunction with Property Filters. Recommendation systems
can further refine the Q6 estimates. Thus, one can compute the direct demo-
graphic reach and target people within an age group, e.g., 18–24, by counting all
l4
node pairs that satisfy Q7 , i.e. (x)←−() −
→ l5 (), s.t x.age ≥ 18 and x.age ≤ 24.

Contributions. Our paper provides the following main contributions:

– We design a property graph summarization algorithm for approximately eval-


uating counting regular path queries (Sect. 3).
– We prove the intractability of the optimal graph summarization problem
under the conditions of our summarization algorithm (Sect. 3).
– We define a query translation module, ensuring that queries on the initial
and summary property graphs are expressible in the same fragment (Sect. 4).
– Based on this, we experimentally exhibit the small relative errors of various
workloads, in the expressive query fragment from Example 1. We measure
the relative response time between estimating counting recursive queries on
1
One of the few benchmarks currently available for generating property graphs.
252 S. Dumbrava et al.

summaries and on the original graphs. For non-recursive queries, we compare


with SumRDF [19], a baseline graph summary for RDF datasets (Sect. 5).

In Sect. 2, we revisit the property graph model and query language. We present
related work in Sect. 6 and conclude the paper in Sect. 7.

l4
l4
l4

P1 P2 P3 R1 R2 R3 R4
l4
l0 l0 l0 l5 l5 l5 l5

l0 P4 l1 M1 M2 M3
l0
l2
l1 l0 l3 l3 l3
l0
P5 P7 P6 F1
l0
l1 l0 l4 l4 Q1 (l5 ) Ans(count( )) ← l5 ( , )
l0
Q2 (l2 ) Ans(count( )) ← l2 ?( , )
P8 P9 l6
l4
Q3 (l0 ) Ans(count( )) ← l0+ ( , )
Q4 (l0 ) Ans(count( )) ← l0∗ ( , )
l0 R5 R6 R7
Q5 (l4 , l1 ) Ans(count( )) ← l4 + l1 ( , )
P10 l5 l5 l5 Q6 (l4 , l5 ) Ans(count( )) ← l4− · l5 ( , )
l2
Q7 (l4 , l5 ) Ans(count(x)) ← l4− · l5 (x, ), ≥ (x.age, 18),
M4 M5 M6
≤ (x.age, 24).
l3
l3 l3 Forum Message Reply Person
knows (l0 ) follows (l1 ) moderates (l2 ) contains (l3 )
F2 authors (l4 ) replies (l5 ) reshares (l6 )

Fig. 1. Example social graph GSN Fig. 2. Targeted advertising queries

2 Preliminaries

Graph Model. We take the property graph model (PGM) [7] as our founda-
tion. Graph instances are multi-edge digraphs; its objects are represented by
typed, data vertices and their relationships, by typed, labeled edges. Vertices
and edges can have any number of properties (key/value pairs). Let LV and LE
be disjoint sets of vertex (edge) labels and G = (V, E), with E ⊆ V × LE × V , a
graph instance. Vertices v ∈ V have an id label, lv , and a set of property labels
(attributes, li ), each with a (potentially undefined) term value. For e ∈ E, we
use the binary notation e = le (v1 , v2 ) and abbreviate v1 , as e.1, and v2 , as e.2.
We denote the number of occurrences of le , as #le , and the set of all edge labels
in G, as Λ(G). Other key notations henceforth used are given in Table 1.

Clauses C ::= A ← A1 , . . . , An | Q ← A1 , . . . , An
Queries Q ::= Ans(count( )) | Ans(count(lv )) | Ans(count(lv1 , lv2 ))
Atoms A ::= π(lv1 , lv2 ) | op(lv1 .li , lv2 .lj ) | op(lv1 .li , k), op ∈ {<, ≤, >, ≥}, k ∈ R
Paths π ::=  | le | le ? | le−1 | le∗ | le1 · le2 | π + π

Fig. 3. Graph query language


Approximate Querying on Property Graphs 253

Graph Query Language. To query the above property graph model, we rely
on an RPQ [10,11] fragment with aggregate operators (see Fig. 3). RPQs cor-
respond to SPARQL 1.1 property paths and are a well-studied query class
tailored to express graph patterns of one or more label-constrained reachabil-
ity paths. For labels lei and vertices vi , the labeled path π, corresponding to
v1 − → le1 v2 . . . vk−1 −
→ lek vk , is the concatenation le1 · . . . · lek . In their full gen-
erality, RPQs allow one to select vertices connected via such labeled paths in
a regular language over LE . We restrict RPQs to handle atomic paths – bi-
directional, optional, single-labeled (le , le ?, and le− ) and transitive single-labeled
(le∗ ) – and composite paths – conjunctive and disjunctive composition of atomic
paths (le · le and π + π). While not as general as SPARQL, our fragment already
captures more than 60% of the property paths found in practice in SPARQL
query logs [8]. Moreover, it captures property path queries, as found in the large
Wikidata corpus studied in [9]. Indeed, almost all the property paths in the con-
sidered logs contain Kleene-star expressions over single labels. In our work, we
enrich the above query classes with the count operator and support basic graph
reachability estimates.

3 Graph Summarization

We introduce a novel algorithm that summarizes any property graph into one
tailored for approximately counting reachability queries. The key idea is that,
as nodes and edges are compressed, informative properties are iteratively added
to the corresponding newly formed structures, to enable accurate estimations.
The grouping phase (Sect. 3.1) computes Φ, a label-driven G-partitioning
into subgroupings, following the connectivity on the most frequent labels in G. A
first summarization collapses the vertices and inner-edges of each subgrouping
into s-nodes and the edges connecting s-nodes, into s-edges. The merge phase
(Sect. 3.2), based on further label-reachability conditions, specified by a heuristic
mode m, collapses s-nodes into h-nodes and s-edges into h-edges.

Table 1. Notation table

G, Φ, v, V, e, E  Graph, graph partitioning, vertex (set), edge


(set)
G ∗ , v ∗ , V ∗ , e∗ , E ∗  S-graph, s-node (set), s-edge (set)
Ĝ, v̂, V̂ , ê, Ê  H-graph, h-node (set), h-edge (set)
λ(G)  label on which a graph G is maximally
l-connected
Λd (v ∗ ), d ∈ {1, 2}  set of edge labels with direction d w.r.t v ∗
(1-incoming, 2-outgoing)
254 S. Dumbrava et al.

3.1 Grouping Phase


For each frequently occurring label l in G, in descending order, we iteratively
partition G into Φ, containing components that are connected on l, as below.
Definition 1 (Maximal L-Connectivity). A G-subgraph2 , G  = (V  , E  ),
is maximally l-connected, i.e., λ(G  ) = l, iff (1) G  is weakly-connected, (2)
removing any l-labeled edge from E  , there exists a V  node pair not connected
by a l+ -labeled undirected path, (3) no l-labeled edge connects a V  node to V \V  .
Example 2. In Fig. 1, G1 is maximally l0 -connected, since it is weakly-connected,
not connected by an l0 -labeled edge to the rest of G, and such that, by removing
P8 −
→ l0 P9 , no undirected, l0+ -labeled path unites P8 and P9 .
We call each such component a subgrouping. The procedure (see Algorithm 1)
computes, as the first grouping, all the subgroupings for the most frequent label,
l1 , and then identifies those corresponding to the rest of the graph and to l2 . At
the end, all remaining nodes are collected into a final subgrouping. We illustrate
this in Fig. 4, on the running example below.
Example 3 (Grouping). In Fig. 1, #l0 = 11, #l1 = 3, #l2 = 2, #l3 = 6, #l4 =
−−−→
#l5 = 7, #l6 = 1, and Λ(G) = [l0 , l5 , l4 , l3 , l1 , l2 , l6 ], as #l4 = #l5 allows arbi-
trary ordering. We add the maximal l0 -connected subgraph, G1 , to Φ. Hence, V =
{Ri∈1,7 , Mi∈1,6 , F1 , F2 }. Next, we add G2 , regrouping the maximal l5 -connected
subgraph. Hence, V = {F1 , F2 }; we add G3 and output Φ = {G1 , G2 , G3 }.

Algorithm 1. GROUPING(G)
Input: G – a graph; Output: Φ – a graph partitioning

−−→ −
−−→
1: n ← |Λ(G)|, Λ(G) ← [l1 , . . . , ln ], Φ ← ∅, i ← 1 Descending frequency label list Λ(G)

−−→
2: for all li ∈ Λ(G) do Label-driven partitioning computation
3: Φ ← Φ ∪ {Gk∗ = (Vk∗ , Ek∗ ) ⊆ G | λ(Gk∗ ) = li } Maximally li -Connected Subgraphs
4: V ← V \ {v ∈ Vk∗ | k ∈ N} Discard Already Considered Nodes
5: i ← i + 1
6: Φ ← Φ ∪ {Gi = (Vi∗ , Ei∗ ) ⊆ G | Vi∗ = V \ V ∗ } Collect Remains in Final Subgroup
7: return Φ

A G-partitioning Φ (see Fig. 4a) is transformed into a s-graph G ∗ = (V ∗ , E ∗ )


(see Fig. 4b). As such, each s-node gathers all the nodes and inner edges of a
Φ-subgrouping, Gj∗ , and each s-edge, all same-labeled cross-edges (edges between
pairwise distinct s-nodes). During this phase, we compute analytics concerning
the regrouped entities. We leverage PGM’s expressivity to internalize these as
properties, e.g., Fig. 5 (right)3 . Hence, to every s-edge, e∗ , we attach EW eight,
2
G  is a G-subgraph iff V  ⊆ V and E  ⊆ E and is weakly connected iff there exists an
undirected path between any pair of vertices.
3
All corresponding formulas are provided in the additional material.
Approximate Querying on Property Graphs 255

l4

l4 l4
v1∗ v2∗ ∗
l 4 v3 v4∗
l4 l3
l4 l4 l3 l3
l4 l4
P1 P2 P3 R1 R2 R3 R4 l6
l4 l2 v5∗ v6∗ v7∗
l0 l0 l0 l5 l5 l5 l5
l2 l3 l3 l3
l0 P4 l1 M1 M2 M3

l0 l1 l0 l4 l4 v8∗ v9∗
l4
P5 P7 P6 R5 R6 R7
(b) Evaluation Phase
l0 l0 l3
l1 l0 l6 l5 l5 l5
l0 l2
P8 P9 l3 M4 M5 M6
l4
l3 v̂1 v̂2
l0 l3 l3 l3 G2
P10
l2
F1 F2
l4
l6 l6
l2 v̂3 l3 v̂1 v̂2
G1 G3
l3 l4
Grouping Subgrouping l2 l3
v̂4 v̂3
(a) Grouping Phase
(c) Source and Target Merge

Fig. 4. Summarization phases for GSN

its number of compressed edges, e.g., in Fig. 4b, all s-edges have weight 1, except
e∗ (v4∗ , v1∗ ), with weight 2. To every s-node, v ∗ , we attach properties concerning:
(1) Compression. VWeight and EWeight store its number of inner vertices/edges.
(2) Inner-Connectivity. The percentage of its l-labeled inner edges is LPercent
and the number of its vertex pairs, connected with an l-labeled edge, is LReach.
These first two types of properties will be useful in Sect. 4, for estimating Kleene
paths, as the labels of inner-edges in s-nodes are not unique, e.g., both l0 and
l1 appear in v1∗ . (3) Outer-Connectivity. For pairs of labels and direction indices
with respect to v ∗ (d = 1, for incoming edges, and d = 2, for outgoing ones), we
compute cross-connectivity, CReach, as the number of binary cross-edge paths
that start/end in v ∗ . Analogously, we record that of binary traversal paths, i.e.,
formed of an inner v ∗ edge and of a cross-edge, as T Reach. Also, for a label l
and given direction, we store, as VF , the number of frontier vertices on l, i.e.,
that of v ∗ nodes at either endpoint of a l-labeled s-edge.
We can thus record traversal connectivity information, LP art, dividing the
number of traversal paths by that of the frontier vertices on the cross-edge label.
Intuitively, this is due to the fact that, traversal connectivity, as opposed to cross
connectivity, also needs to account for the “dispersion” of the inner-edge label
of the path, within the s-node it belongs to. For example, for a traversal path
lc · li , formed of a cross-edge, lc , and an inner one, li , not all frontier nodes lc
are endpoints of li labeled inner-edges, as we will see in the example below.

Example 4 (Outer-Connectivity). Figure 5 (left) depicts a stand-alone example,


such that circles denote s-nodes, labeled arrows denote the s-edges relating
them, and crosses represent nameless vertices, as we only label relevant ones,
256 S. Dumbrava et al.

VW eight v1∗ ∗
10, v{2,3,5,6,7} 2,
∗ ∗
v4 3, v{8,9} 1
EW eight v1∗ ∗
14, v{2,3,5,6,7} 1,
∗ ∗
v4 3, v{8,9} 0
LReach (v1∗ , l0 ) 11, (v1∗ , l1 ) 3

LP ercent (v1 , l0 ) 79, (v1∗ , l1 ) 21

Fig. 5. Selected properties for Fig. 4b (right); Frontier vertices (left)

for simplicity. We use this configuration to illustrate analytics regarding cross


and traversal connectivity on labels l1 and l2 . For instance, as we will see in
Sect. 4, when counting l1 · l2− cross-edge paths, we will look at the CReach s-
node properties mentioning these labels and note that there is a single such
one, i.e., that corresponding to l1 and l2 appearing on edges incoming v1∗ , i.e.,
CReach(v1∗ , l1 , l2 , 1, 1) = 1. When counting l1 · l2 traversal paths, for the case
when l1 appears on the cross-edge, we will look at the properties of s-nodes con-
taining l2 inner-edges. Hence, for v2∗ , we note that there is a single such path,
formed by an outgoing l2 edge and incoming l1 edge, as T Reach(v2∗ , l1 , l2 , 1, 1) =
1. To estimate the traversal connectivity we will divide this by the number of
frontier vertices on incoming l1 edges. As, VF (v2∗ , l1 , 1) = {v2 , v3 }, we have that
LPart(v2∗ , l1 , l2 , 1, 1) = 0.5.

3.2 Merge Phase

We take as input the graph computed by Algorithm 1, and a label set and out-
put a compressed graph, Ĝ = (V̂ , Ê). During this phase, sets of h-nodes, V̂ ,
and h-edges, Ê, are created. At each step, as previously, Ĝ is enriched with
approximation-relevant precomputed properties (see Sect. 4).
Each h-node, v̂, merges all s-nodes, vi∗ , vj∗ ∈ V ∗ , that are maximally label
connected on the same label, i.e., λ(vi∗ ) = λ(vj∗ ), and that have either the
same set of incoming (source-merge) or outgoing (target-merge) edge labels, i.e.,
Λd (vi∗ ) = Λd (vj∗ ), d ∈ {1, 2} (see Algorithm 2). Each h-edge, ê, merges all s-edges
in E ∗ with the same label and orientation, i.e., e∗i .d = e∗j .d, for d ∈ {1, 2}.

Algorithm 2. MERGE(V ∗ , Λ, m)
Input: V ∗ – s-nodes; Λ – labels; m – heuristic mode; Output: V̂ – h-nodes
1: for all v ∗ ∈ V ∗ do
2: Λd (v ∗ ) ← {l ∈ Λ | ∃e∗ = l( , ) ∈ E ∗ ∧ e.d = v ∗ } Labels Incoming/Outgoing v∗
∗ ∗ ∗
3: for all v1 , v2 ∈ V do Pair-wise S-node Inspection
? ?
4: bλ ← λ(v1∗ ) = λ(v2∗ ), bd ← Λd (v1∗ ) = Λd (v2∗ ), d ∈ {1, 2} Boolean Conditions
5: if m = true then v̂ ← {v1∗ , v2∗ | bλ ∧ b1 = true} Target-Merge
6: else v̂ ← {v1∗ , v2∗ | bλ ∧ b2 = true} Source-Merge
7: V̂ ← {v̂k | k ∈ [1, |V ∗ |]} H-node Computation
8: return V̂
Approximate Querying on Property Graphs 257

To each h-node, we attach properties, whose values, except LP ercent, are


the sum of those corresponding to each of its s-nodes. For the label percentage,
these values record the weighted percentage mean. Next, we merge s-edges into
h-edges, if they have the same label and endpoints, and attach to each h-edge, its
number of compressed s-edges, EW eight. We also record the avg. s-node weight,
V ∗ W eight, to estimate how many nodes a h-node compresses.
To formally characterize the graph transformation corresponding to our sum-
marization technique, we first define the following function.

Definition 2 (Valid Summarization). For G = (V, E), a valid summariza-


tion function χΛ : V → N assigns vertex identifiers, s.t., any vertices with the
same identifier are either in the same maximally l-connected G-subgraph, or in
different ones, not connected by an l-labeled edge.

A valid summary is thus obtained from G, by collapsing vertices with the same
χΛ into h-nodes and edges with the same (depending on the heuristic, ingo-
ing/outgoing) label into h-edges. We illustrate this below.

Example 5 (Graph Compression). The graphs in Fig. 4c are obtained from G ∗ =


(V ∗ , E ∗ ), after the merge phase. Each h-node contains the s-nodes (see Fig. 4b)
collapsed via the target-merge (left) and source-merge (right) heuristics.

We study our summarization’s optimality, i.e., the size of the obtained com-
pressed graph, to graphs its tractability. Specifically, we investigate the follow-
ing MinSummary problem, to establish whether one can always minimize the
number of nodes of an input graph, when constructing its valid summary.
Problem 1 (Minimal Summary). Let MinSummary be the problem that, for a
graph G and an integer k  ≥ 2, decides if there exists a label-driven partitioning
Φ of G, |Φ| ≤ k  , such that χΛ is a valid summarization.
Each MinSummary h-node is thus intended to regroup as many nodes from
the original graph as possible, while ensuring these are connected by frequently
occurring labels. This condition (see Definition 2) reflects the central idea of our
framework, namely that the connectivity of such prominent labels can serve to
both compress a graph and to approximately evaluate label-constrained reacha-
bility queries. Next, we establish the difficulty of solving MinSummary.

Theorem 1 (MinSummary NP-completeness). Even for undirected


graphs, |Λ(G)| ≤ 2, and k  = 2, MinSummary is NP-complete4 .

The intractability of constructing an optimal summary thus justifies our


search for heuristics with good performance in practice.

4
Proof given at: http://web4.ensiie.fr/∼stefania.dumbrava/SUM19 appx.pdf.
258 S. Dumbrava et al.

4 Approximate Query Evaluation

Query Translation. For G and a counting reachability query Q, we approxi-


mate [[Q]]G , the evaluation of Q over G. We translate Q into a query QT , evaluated
over the summarization Ĝ of G, s.t [[QT ]]Ĝ ≈ [[Q]]G . The translations by input
query type are given in Fig. 6, with PGQL as concrete syntax. (1) Simple and
Optional Label Queries. A label l occurs in Ĝ either within a h-node or on a cross-
edge. Thus, we either cumulate the number of l-labeled h-node inner-edges or the
l-labeled cross-edge weights. To account for the potential absence of l, we also
estimate, in the optional-label queries, the number of nodes in Ĝ, by cumulating
those in each h-node. (2) Kleene Plus and Kleene Star Queries. To estimate l+ ,
we cumulate the counts within h-nodes containing l-labeled inner-edges and the
weights on l-labeled cross-edges. For the former, we distinguish whether the l+
reachability is due to: (1) inner-connectivity – we use the property counting the
inner l-paths; (2) incoming cross-edges – we cumulate the l-labeled in-degrees
of h-nodes; or (3) outgoing cross-edges – we cumulate the number of outgoing
l-paths. To handle the -label in l∗ , we also estimate the number of nodes in Ĝ.
(3) Disjunction. We treat each possible configuration, on both labels. Hence, we
either cumulate the number of h-node inner-edges or that of cross-edge weights,
with either label. (4) Binary Conjunction. We distinguish whether the label pair
appears on an inner h-node path, on a cross-edge path, or on a traversal one.

Example 6. We illustrate the approximate evaluation of these query types on


Fig. 4. To evaluate the number of single-label atomic paths, e.g., QTL (l5 ), as
l5 only occurs inside h-node v̂2 , [[l5 ]]Ĝ is the amount of l5 -labeled inner edges
in v̂2 , i.e., EW eight(v̂2 , l5 ) ∗ LP ercent(v̂2 , l5 ) = 7. To estimate the number of
optional label atomic  paths, e.g., QTO (l2 ), we add to QTL (l2 ) the total number

of graph vertices, v̂∈V̂ V W eight(v̂) ∗ VW eight(v̂) (empty case). As l2 only
appears on a h-edge of weight 2 and there are 25 initial vertices, [[l2 ?]]Ĝ is 27. To
estimate Kleene-plus queries, e.g., QTP (l0 ), as no h-edge has label l0 , we return
LReach(v̂1 , l0 ), i.e., the number of l0 -connected vertex pairs. Thus, [[l0+ ]]Ĝ is 15.
For Kleene-star, we add to this, the previously computed total number of vertices
and obtain that [[l0∗ ]]Ĝ is 40. For disjunction queries, e.g., [[l4 + l1 ]]Ĝ , we cumulate
the single-labeled atomic paths on each label, yielding 14. For binary conjunc-
tions, e.g., [[l4− · l5 ]]Ĝ , we rely on the traversal connectivity, LP art(v ∗ , l4 , l5 , 2, 2),
as l4 appears on a h-edge and, l5 , inside h-nodes; we thus count 7 node pairs.
Approximate Querying on Property Graphs 259

QL (l) SELECT COUNT(*) MATCH () -[:l]-> ()


QT
L (l)
SELECT SUM(x.LPERCENT_L * x.EWEIGHT) MATCH (x)
SELECT SUM(e.EWEIGHT) MATCH () -[e:l]-> ()
QO (l) SELECT COUNT(*) MATCH () -[:l?]-> ()
SELECT SUM(x.LPERCENT_L * x.EWEIGHT) MATCH (x)
QT
O SELECT SUM(e.EWEIGHT) MATCH () -[e:l]-> ()
SELECT SUM(x.AVG_SN_VWEIGHT * x.VWEIGHT) MATCH (x)
QP (l) SELECT COUNT(*) MATCH () -/:l+/-> ()
QT
P (l)
SELECT SUM(x.LREACH_L) MATCH (x) WHERE x.LREACH_L > 0
SELECT SUM(e.EWEIGHT) MATCH () -[e:l]-> ()
QS (l) SELECT COUNT(*) MATCH () -/:l*/-> ()
SELECT SUM(x.LREACH_L) MATCH (x) WHERE x.LREACH_L > 0
QT
S (l) SELECT SUM(e.EWEIGHT) MATCH () -[e:l]-> ()
SELECT SUM(x.AVG_SN_VWEIGHT * x.VWEIGHT) MATCH (x)
QD (l1 , l2 ) SELECT COUNT(*) MATCH () -[:l1|l2]-> ()
QT
D (l1 , l2 )
SELECT SUM(x.LPERCENT_L1 * x.EWEIGHT + x.LPERCENT_L2 * x.EWEIGHT) MATCH (x)
SELECT SUM(e.EWEIGHT) MATCH () -[e:l1|l2]-> ()
QC (l1 , l2 , 1, 1) SELECT COUNT(*) MATCH () -[:l1]-> () <-[:l2]- ()
QC (l1 , l2 , 1, 2) SELECT COUNT(*) MATCH () -[:l1]-> () -[:l2]-> ()
QC (l1 , l2 , 2, 1) SELECT COUNT(*) MATCH () <-[:l1]- () <-[:l2]- ()
QC (l1 , l2 , 2, 2) SELECT COUNT(*) MATCH () <-[:l1]- () -[:l2]-> ()
SELECT SUM((x.LPART_L2_L1_D2_D1 * e.EWEIGHT)/(x.LPERCENT_L1 * x.VWEIGHT))
MATCH (x) -[e:l2] -> () WHERE x.LPERCENT_L1 > 0
QT
C (l1 , l2 , d1 , d2 )
SELECT SUM((y.LPART_L1_L2_D1_D2 * e.EWEIGHT)/(y.LPERCENT_L2 * y.VWEIGHT))
MATCH () -[e:l1] -> (y) WHERE y.LPERCENT_L2 >0
SELECT SUM(x.CREACH_L1_L2_D1_D2) MATCH (x)
SELECT SUM(x.EWEIGHT * min(x.LPERCENT_L1, x.LPERCENT_L2)) MATCH (x)

Fig. 6. Query translations onto the graph summary.

5 Experimental Analysis

In this section, we present an empirical evaluation of our graph summariza-


tion, recording (1) the succinctness of our summaries and the efficiency of the
underlying algorithm and (2) the suitability of our summaries for approximate
evaluation of counting label-constrained reachability queries.
Setup, Datasets and Implementation. The summarization and approxima-
tion modules are implemented in Java using OpenJDK 1.85 . As the underlying
graph database backend, we have used Oracle Labs PGX 3.1, which is the only
property graph engine allowing for the evaluation of complex RPQs.
To implement the intermediate graph analysis operations (e.g., weakly con-
nected components), we used the Green-Marl domain-specific language and mod-
ified the methods to fit the construction of node properties required by our
summarization algorithm. We base our analysis on the graph datasets in Fig. 7,
encoding: a Bibliographic network (bib), the LDBC social network schema [12]
(social ), Uniprot knowledge graphs (uniprot), and the WatDiv schema [1] (shop).
We obtained these datasets using gMark [5], a synthetic graph instance and
query workload generator. As gMark tries to construct the instance that best fits
5
Available at: https://github.com/grasp-algorithm/label-driven-summarization.
260 S. Dumbrava et al.

∼ 1K ∼ 5K ∼ 25K ∼ 50K ∼ 100K ∼ 200K


Dataset |LV | |LE |
|V | |E| |V | |E| |V | |E| |V | |E| |V | |E| |V | |E|
bib 5 4 916 1304 4565 6140 22780 3159 44658 60300 88879 119575 179356 240052
social 15 27 897 2127 4434 10896 22252 55760 44390 110665 88715 223376 177301 450087
uniprot 5 7 2170 3898 6837 18899 25800 97059 47874 192574 91600 386810 177799 773082
shop 24 82 3136 4318 6605 10811 17893 34052 31181 56443 57131 93780 109205 168934

Fig. 7. Datasets: no. of vertices |V |, edges |E|, vertex |LV | and edge labels |LE |.

the size parameter and schema constraints, the resulting sizes vary (especially for
the very dense graphs social and shop). Next, on the same datasets, we generated
workloads of varying sizes, for each type in Sect. 2. These datasets and related
query workloads have been chosen since they provide the most recent benchmarks
for recursive graph queries and also to ensure a comparison with SumRDF [19]
(as shown next) on a subset of those supported by the latter. Studies [8,17] have
shown that practical graph pattern queries formulated by users in online query
endpoints are often small: 56.5% of real-life SPARQL queries consist of a single
edge (RDF triple), whereas 90.8% use 6 edges at most. Hence, we select small-
sized template queries with frequently occurring topologies, such as chains [8],
and formulate them on our datasets, for workloads of ∼600 queries.
Experiments ran on a cloud VM with Intel Xeon E312xx, 4 cores, 1.80 GHz
CPU, 128 GB RAM, and Ubuntu 16.04.4 64-bit. Each data point corresponds to
repeating an experiment 6 times, removing the first value from the average.
Summary Compression Ratios. First, we evaluate the effect that using the
source-merge and target-merge heuristics has on the summary construction time
(SCT). We also assess the compression ratio (CR) on the original graph’s vertices
ˆ
and edges, by measuring (1 − |V̂|/|V|) ∗ 100 and, respectively, (1 − |E|/|E|) ∗ 100.
Next, we compare the results for source and target merge. In Fig. 8(a-d), the
most homogeneous datasets, bib and uniprot, achieve very high CR (close to
100%) and steadily maintain it with varying graph sizes. As far as heterogeneity
significantly grows for shop and social, the CR becomes eagerly sensitive to
the dataset size, starting with low values, for smaller graphs, and stabilizing
between 85% and 90%, for larger ones. Notice also that the most heterogeneous
datasets, shop and social, although similar, display a symmetric behavior for the
vertex and edge CRs: the former better compresses vertices, while the latter,
edges. Concerning the SCT runtime in Fig. 8(e-f), all datasets keep a reasonable
performance for larger sizes, even the most heterogeneous one shop. The runtime
is, in fact, not affected by heterogeneity, but is rather sensitive, for larger sizes,
to |E| variations (up to 450K and 773K, for uniprot and social ). Also, while
the source and target merge SCT runtimes are similar, the latter achieves better
CRs for social. Overall, the dataset with the worst CR for the two heuristics is
shop, with the lowest CR for smaller sizes. This is also due to the high number of
labels in the initial shop instances, and, hence, to the high number of properties
its summary needs: on average, for all considered sizes, 62.33 properties, against
17.67, for social graph, 10.0, for bib, and 14.0, for uniprot. These experiments
Approximate Querying on Property Graphs 261

Datasets:
Shop Social Bib Uniprot

(a) CR Edges (s−m) (b) CR Edges (t−m) (c) CR Vertices (s−m) (d) CR Vertices (t−m)

92 94 96 98 100

95 100
90 100

60 70 80 90 100

CR Vertices (%)
CR Edges (%)
80

90
70

85
60
1K

5K

K
0K

0K

1K

5K

K
0K

0K

1K

5K

K
0K

0K

1K

5K

K
0K

0K
25

50

25

50

25

50

25

50
10

20

10

20

10

20

10

20
(e) SCT (s−m) [sec] (f) SCT (t−m) [sec]

Execution time (sec)

6000
6000

0 2000
0 2000
1K

5K

K
0K

0K

1K

5K

K
0K

0K
25

50

25

50
10

20

10

20
Graph sizes (# nodes)

Fig. 8. CRs for vertices and edges, along with SCT runtime for various dataset sizes,
for both source-merge (a-c-e), and target-merge (b-d-f).

show that, despite its high complexity, our summarization provides high CRs
and low SCT runtimes, even for large, heterogeneous graphs.
Approximate Evaluation Accuracy. We assess the accuracy and efficiency
of our engine with the relative error and time gain measures, respectively. The
relative error (per query Qi ) is 1 − min(Qi (G), QTi (Ĝ))/ max(Qi (G), QTi (Ĝ)) (in
%), where Qi (G) computes (with PGX) the counting query Qi , on the original
graph, and QTi (Ĝ) computes (with our engine) the translated query QTi , on the
summary. The time gain is: tG − tĜ /max(tG , tĜ ) (in %), where tG and tĜ are the
query evaluation times of Qi on the original graph and on the summary.
For the Disjunction, Kleene-plus, Kleene-star, Optional and Single Label
query types, we have generated workloads of different sizes, bound by the num-
ber of labels in each dataset. For the concatenation workloads, we considered
binary conjunctive queries (CQs) without disjunction, recursion, or optionality.
Note that, currently, our summaries do not support compositionality.
Figure 9(a) and (b) show the relative error and average time gain for the
Disjunction, Kleene-plus, Kleene-star, Optional and Single Label workloads. In
Fig. 9(a), we note that the avg. relative error is kept low in all cases and is bound
by 5.5%, for the Kleene-plus and Kleene-star workloads of the social dataset.
In all the other cases, including the Kleene-plus and Kleene-star workloads of
the shop dataset, the error is relatively small (near 0%). This confirms the effec-
tiveness of our graph summaries for approximate evaluation of graph queries. In
Fig. 9(b), we studied the efficiency of approximate evaluation on our summaries
by reporting the time gain (in %) compared with the query evaluation on the
original graphs for the four datasets. We notice a positive time gain (≥75%)
in most cases, but for disjunction. While the relative approximation error is
still advantageous for disjunction, disjunctive queries are time-consuming for
262 S. Dumbrava et al.

(a) Avg. Rel. Error/Workload (b) Avg. Time Gain/Workload

Fig. 9. Rel. Error (a), Time Gain (b) per Workload, per Dataset, 200K nodes.

Approx. Answer Rel. Error (%) Runtime (ms)


ID Query Body
SumRDF APP SumRDF APP SumRDF APP
Q1 (x0)-[:producer]->()<-[:paymentAccepted]-(x1) 75 76 1.32 0.00 136.30 38.2
Q2 (x0)-[:totalVotes]->()<-[:price]-(x1) 42.4 44 3.64 0.00 50.99 17
Q3 (x0)-[:jobTitle]->()<-[:keywords]-(x1) 226.7 221 2.51 0.18 463.85 12.8
Q4 (x0)<-[:title]-()-[:performedIn]->(x1) 19.5 20 2.50 0.00 831.72 8.8
Q5 (x0)-[:artist]->()<-[:employee]-(x1) 143.3 133 7.19 0.37 196.77 10.6
Q6 (x0)-[:follows]->()<-[:editor]-(x1) 524 528 0.38 0.48 1295.83 19

Fig. 10. Performance Comparison: SumRDF vs. APP (our approach): approx. eval. of
binary CQs, SELECT COUNT(*) MATCH Qi , on the summaries of a shop graph instance
(31K nodes, 56K edges); comparing estimated cardinality (no. of computed answers),
rel. error w.r.t the original graph results, and query runtime.

approximate evaluation on our summaries, especially for extremely heteroge-


neous datasets, such as shop (having the most labels). This is due to the over-
head introduced by considering all possible connectivity combinations on the
disjunctive labels. The problem of scaling our method, without prohibitive accu-
racy loss, to queries involving multiple labels and further compositionality, e.g.,
Kleene-star over disjunctions [22], is challenging and falls under the scope of
future work.
Baseline for Approximate Query Evaluation Performance. The clos-
est system to ours is SumRDF [19] (see Sect. 6), which, however, operates on
a simpler edge-labeled model rather than on property graphs and is tailored for
estimating the results of conjunctive queries only. As a performance baseline, we
considered the shop dataset in gMark [5], simulating the WatDiv benchmark [1]
(also a benchmark in [19]). From this dataset with 31K nodes and 56K edges,
we generated the corresponding SumRDF and our summaries. We obtained a
better CR than SumRDF, with 2737 nodes vs. 3480 resources and 17430 edges
vs. 29621 triples. This comparison is, however, tentative, as our approach com-
presses vertices independently of the edges, while SumRDF returns triples. We
then considered the same CQ types as in Fig. 10. Comparing our approach vs.
Approximate Querying on Property Graphs 263

SumRDF (see Fig. 10), we recorded an average relative error of estimation of only
0.15%. vs. 2.5% and an average query runtime of only 27.55 ms vs. 427.53
ms. As SumRDF does not support disjunctions, Kleene-star/plus queries and
optional queries, further comparisons were not possible.

6 Related Work

Preliminary work on approximate graph analytics in a distributed setting has


recently been pursued in [15]. They rather focus on a graph sparsification tech-
nique and small samples, in order to approximate the results of specific graph
algorithms, such as PageRank and triangle counting on undirected graphs. In
contrast, our approach operates in a centralized setting and relies on query-
driven graph summarization for graph navigational queries with aggregates.
RDF graph summarization for cardinality estimation has been tackled in [19],
albeit for a less expressive data model than ours (plain RDF vs. property graphs).
They focus on Basic Graph Patterns (BGP), hence their considered query frag-
ment has limited overlap with ours. As shown in Sect. 5, our approximate eval-
uation is faster and more accurate on a common set of (non recursive) queries.
An algorithm for answering graph reachability queries, using graph simu-
lation based pattern matching, is given in [13], to construct query preserving
summaries. However, it does not consider property graphs or aggregates.
Aggregation-based graph summarization [16] is at the heart of previous
approaches, the most notable of which is SNAP [20]. This method is mainly
devoted to discovery-driven graph summarization of heterogeneous networks and
is unsuitable for approximate query evaluation.
More recently, Rudolf et al. [18] have introduced a graph summary suitable
for property graphs based on a set of input summarization rules. However, it does
not support the label-constrained reachability queries in this paper. Graph sum-
maries for answering subgraphs returned by keyword queries on large networks
are studied in [24]. Our query classes significantly differ from theirs.

7 Conclusion
Our paper focuses on a novel graph summarization method that is suitable for
property graph querying. As the underlying MinSummary decision problem is
NP-complete, this technique builds on an heuristic that compresses label fre-
quency information in the nodes of the graph summary. We show the practical
effectiveness of our approach, in terms of compression ratios, error rates and
query evaluation time. As future work, we plan to investigate the feasibility of
our graph summary for other query classes, such as those described in [22]. Also,
we aim to apply formal methods, as described in [6], to ascertain the correctness
of our approximation algorithm, with provably tight error bounds.
264 S. Dumbrava et al.

References
1. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF
data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796,
pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-
9 13
2. Angles, R., et al.: G-CORE: a core for future graph query languages. In: SIGMOD,
pp. 1421–1432 (2018)
3. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J.L., Vrgoc, D.: Founda-
tions of modern query languages for graph databases. ACM Comput. Surv. 50(5),
68:1–68:40 (2017)
4. Arenas, M., Conca, S., Pérez, J.: Counting beyond a Yottabyte, or how SPARQL
1.1 property paths will prevent adoption of the standard. In: WWW, pp. 629–638
(2012)
5. Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H.L., Lemay, A., Advokaat, N.:
gMark: schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data
Eng. 29(4), 856–869 (2017)
6. Bonifati, A., Dumbrava, S., Arias, E.J.G.: Certified graph view maintenance with
regular datalog. TPLP 18(3–4), 372–389 (2018)
7. Bonifati, A., Fletcher, G., Voigt, H., Yakovets, N.: Querying Graphs. Synthesis
Lectures on Data Management. Morgan & Claypool Publishers (2018)
8. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query
logs. PVLDB 11(2), 149–161 (2017)
9. Bonifati, A., Martens, W., Timm, T.: Navigating the maze of Wikidata query logs.
In: WWW, pp. 127–138 (2019)
10. Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Rewriting of regular
expressions and regular path queries. J. Comput. Syst. Sci. 64(3), 443–465 (2002)
11. Cruz, I.F., Mendelzon, A.O., Wood, P.T.: A graphical query language supporting
recursion. In: SIGMOD, pp. 323–330 (1987)
12. Erling, O., et al.: The LDBC social network benchmark: interactive workload. In:
SIGMOD, pp. 619–630 (2015)
13. Fan, W., Li, J., Wang, X., Wu, Y.: Query preserving graph compression. In: SIG-
MOD, pp. 157–168 (2012)
14. Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying Wikidata:
comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.)
ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). https://doi.org/
10.1007/978-3-319-46547-0 10
15. Iyer, A.P., et al.: Bridging the GAP: towards approximate graph analytics. In:
GRADES, pp. 10:1–10:5 (2018)
16. Khan, A., Bhowmick, S.S., Bonchi, F.: Summarizing static and dynamic big graphs.
PVLDB 10(12), 1981–1984 (2017)
17. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the
most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph.
In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer,
Cham (2018). https://doi.org/10.1007/978-3-030-00668-6 23
18. Rudolf, M., Voigt, H., Bornhövd, C., Lehner, W.: SynopSys: foundations for multi-
dimensional graph analytics. In: Castellanos, M., Dayal, U., Pedersen, T.B., Tatbul,
N. (eds.) BIRTE 2013-2014. LNBIP, vol. 206, pp. 159–166. Springer, Heidelberg
(2015). https://doi.org/10.1007/978-3-662-46839-5 11
Approximate Querying on Property Graphs 265

19. Stefanoni, G., Motik, B., Kostylev, E.V.: Estimating the cardinality of conjunctive
queries over RDF data using graph summarisation. In: WWW, pp. 1043–1052
(2018)
20. Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summariza-
tion. In: SIGMOD, pp. 567–580. ACM (2008)
21. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J.
Comput. 8(3), 410–421 (1979)
22. Valstar, L.D.J., Fletcher, G.H.L., Yoshida, Y.: Landmark indexing for evaluation
of label-constrained reachability queries. In: SIGMOD, pp. 345–358 (2017)
23. Wood, P.T.: Query languages for graph databases. SIGMOD Rec. 41(1), 50–60
(2012)
24. Wu, Y., Yang, S., Srivatsa, M., Iyengar, A., Yan, X.: Summarizing answer graphs
induced by keyword queries. PVLDB 6(14), 1774–1785 (2013)
Learning from Imprecise Data:
Adjustments of Optimistic and
Pessimistic Variants

Eyke Hüllermeier1 , Sébastien Destercke2(B) , and Ines Couso3


1
Heinz Nixdorf Institute and Department of Computer Science, Intelligent Systems
and Machine Learning Group, Paderborn University, Paderborn, Germany
[email protected]
2
UMR CNRS 7253 Heudiasyc, Sorbonne Universités, Université de Technologie de
Compiègne, Compiègne, France
[email protected]
3
Department of Statistics and Operations Research,
University of Oviedo, Oviedo, Spain
[email protected]

Abstract. The problem of learning from imprecise data has recently


attracted increasing attention, and various methods to tackle this prob-
lem have been proposed. In this paper, we discuss and compare two quite
opposite approaches, an “optimistic” one that interprets imprecise data
in a way that is most favourable for a candidate model, and a “pes-
simistic” one in which model choice is guided by the most unfavourable
interpretation. To avoid an overly extreme behaviour, a modified version
of the latter has recently been proposed, which we complement by an
adjusted version of the optimistic approach. By presenting the various
methods within a common (loss minimization) framework and discussing
illustrative examples, we hope to provide some insight into important
properties and differences, thereby paving the way for a more formal
analysis.

1 Introduction
Superset learning is a specific type of learning from weak supervision, in which
the outcome (response) associated with a training instance is only characterized
in terms of a set of possible candidates. There are numerous applications in which
supervision is partial in that sense [9]. Correspondingly, the superset learning
problem has received increasing attention in recent years, and has been studied
under various names, such as learning from ambiguously labelled examples or
learning from partial labels [2,10]. The contributions so far also differ with regard
to their assumptions on the incomplete information being provided, and how it
has been produced. In this paper, we only assume the actual outcome to be
covered by the subset—hence the name superset learning.
In spite of the ambiguous, set-valued training data, the goal that is commonly
considered in superset learning is to induce a unique model, or a set of models
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 266–279, 2019.
https://doi.org/10.1007/978-3-030-35514-2_20
Optimistic and Pessimistic Learning from Imprecise Data 267

that are all deemed optimal (in the sense of fitting the observed data equally
well) and not differentiated any further. This differs from approaches that allow
for a set of incomparable, undominated models, resulting for instance from the
interval order induced by set-valued loss functions [3], or by the application of
conservative, imprecise Bayesian updating rules [11].
In this paper, we reconsider the principle of generalized loss minimization
based on the so-called optimistic superset loss (OSL) as introduced in [7]. To
better understand its nature and possible deficiencies, we contrast the latter
with another, in a sense diametral approach based on a “pessimistic” inference
principle. Moreover, to compensate for a bias that might be caused by an overly
optimistic attitude, we propose an adjustment of the OSL, which can be seen as
a counterpart of a corresponding modification of the pessimistic approach [6].
Presenting the various methods within a common framework of loss minimization
in supervised learning allows us to highlight some important properties and
differences through illustrative examples.

2 Preliminaries
2.1 Setting and Notation
The OSL was introduced in a standard setting of supervised learning with an
input (instance) space X and an output space Y. The goal is to learn a mapping
from X to Y that captures, in one way or the other, the dependence of outputs
(responses) on inputs (predictors). The learning problem essentially consists of
choosing an optimal model (hypothesis) h∗ from a given model space (hypothesis
space) H, based on a set of training data
 N
D= (xn , yn ) n=1
∈ (X × Y)N . (1)

More specifically, optimality typically refers to optimal prediction accuracy, i.e.,


a model is sought whose expected prediction loss or risk

 
R(h) = L y, h(x) d P(x, y) (2)

is minimal; here, L : Y × Y −→ R is a loss function, and P is an (unknown)


probability measure on X × Y modeling the underlying data generating process.
In the following, we assume hypotheses to be uniquely defined in terms of a
parameter θ from an underlying parameter space Θ: H = {hθ | θ ∈ Θ}, where
hθ is the hypothesis associated with θ. Selecting an optimal hypothesis h∗ ∈ H
thus reduces to estimating an optimal parameter θ∗ ∈ Θ.
We are interested in the case where parts of the data are not observed pre-
cisely. More specifically, focusing on the output values1 yn ∈ Y, we assume that
only supersets Yn ⊆ Y are observed. Thus, the learning algorithm does not
1
The principles of optimistic (and likewise pessimistic) loss minimization also extend
to the case of imprecision in the instance features.
268 E. Hüllermeier et al.

have direct access to the (precise) data (1), but only to the (imprecise, coarse,
ambiguous) observations
 N
O = (xn , Yn ) n=1 ∈ (X × 2Y )N . (3)

In the following, we denote by Y = Y1 ×Y2 ×· · ·×YN the (Cartesian) product


of the supersets observed for x1 , . . . , xN . Moreover, each y = (y1 , . . . , yN ) ∈ Y
is called an instantiation of the imprecisely observed data. More generally, we
call a sample D in (1) an instantiation of O if the instances xn coincide and
yn ∈ Yn for all n ∈ [N ] ..= {1, . . . , N }.

2.2 Optimistic and Pessimistic Learning

According to [7], a candidate θ ∈ Θ is evaluated optimistically in terms of

1   
N
ROP T .
emp (θ) .= min L yn , hθ (xn ) , (4)
y ∈Y N n=1

i.e., in terms of the empirical risk of hθ in the case of a most favourable selection of
the outcomes yn . Moreover, given a loss L that is decomposable (over examples),
the “optimism” can be moved into the loss:

1   
N
θ∗ ..= argmin ROP T
emp (θ) = argmin LO Yn , hθ (xn ) , (5)
θ∈Θ θ∈Θ N n=1

with the optimistic superset loss (OSL)


 
LO (Y, ŷ) = min L(y, ŷ) | y ∈ Y , (6)

which compares (precise) predictions with set-valued observations. A key moti-


vation of the OSL is the idea of data disambiguation, i.e., the idea of simulta-
neously inducing the true model (parameter θ) and reconstructing the values of
the underlying precise data.
A completely opposite principle is to replace the optimistic minimum in (4)
by a pessimistic maximum [5]. More specifically, this principle was introduced
in the realm of statistical inference (instead of supervised learning) with L the
logistic loss, i.e., in the setting of maximum likelihood inference. The idea is to
evaluate each candidate θ in terms of the worst likelihood it can achieve over all
instantiations y ∈ Y, and to pick the best among these pessimistic evaluations.
Expressed in terms of generic loss functions (possibly but not necessarily the
logistic loss), this principle would amount to considering

1   
N
RP ESS .
emp (θ) .= max L yn , hθ (xn ) , (7)
y ∈Y N n=1
Optimistic and Pessimistic Learning from Imprecise Data 269

and (again assuming the loss to be decomposable) choosing

1   
N
θ∗ ..= argmin RP ESS
emp (θ) = argmin LP Yn , hθ (xn ) (8)
θ∈Θ θ∈Θ N n=1

as a presumably best model, with the pessimistic superset loss (PSL)


 
LP (Y, ŷ) = max L(y, ŷ) | y ∈ Y . (9)

3 Illustrative Examples
Which of the two approaches to superset learning is more reasonable, the opti-
mistic or the pessimistic one? This question is difficult (or actually impossible)
to answer without further assumptions on the coarsening process, i.e., the pro-
cess that turns precise data into imprecise observations. In the following, to get
a better idea of the nature of the two approaches, we illustrate them by some
simple examples. We shall refer to the optimistic approach (based on the OSL)
as OPT and to the pessimistic one (based on the PSL) and PESS.

3.1 Linear Regression

In linear regression, X = Rd , Y = R, and the goal is to learn a linear predictor


h(x) = x θ = x, θ. Training data is typically assumed to be noisy observations
yn = x θ0 + , where θ0 is the ground-truth parameter and  a noise term
(with zero expectation). Correspondingly, in the setting of superset learning, we
assume observations Yn  yn . Note, therefore, that Yn does not necessarily cover
the ideal outcome (e.g., the expected value E(y | xn ) = x θ0 ); instead, just like
the precise observation yn itself, it might be shifted by the noise.
To evaluate predictions ŷ = h(x), the loss function most commonly used in
linear regression is the squared error loss. For the case of interval-valued data
Y = [ymin , ymax ], the OSL (6) is then given as follows (cf. Fig. 1):

 ⎨ (ymin − ŷ) if ŷ < ymin
2

LO [ymin , ymax ], ŷ = 0 if ymin ≤ ŷ ≤ ymax (10)

(ŷ − ymax )2 if ymax < ŷ

Thus, the loss is 0 if the prediction is inside the interval, i.e., if the regression
function intersects with the interval, and grows quadratically with the distance
from the interval outside. A small one-dimensional example of a set of interval-
valued data together with a regression line minimizing (5) is shown in Fig. 2
(left).
The PSL version (9) of the squared error loss is given as follows (cf. Fig. 1):

  (ymax − ŷ)2 if ŷ < 12 (ymin + ymax )
LP [ymin , ymax ], ŷ = (11)
(ŷ − ymin )2 if ŷ ≥ 12 (ymin + ymax )
270 E. Hüllermeier et al.

Fig. 1. The OSL (solid line in blue) and PSL (dashed line in red) as extensions of the
squared error loss (gray line) in the case of an interval-valued observation (here the
interval [−1, 1], indicated by the vertical lines). (Color figure online)

As can be seen in Fig. 1, the PSL targets the midpoint of the interval as an opti-
mal “compromise value”; this point minimizes the maximal prediction error pos-
sible, and hence the loss function. Moreover, the larger the interval, the stronger
the loss function increases. Therefore, PESS is very similar to weighted linear
regression, where the weight of an example increases with the width of the cor-
responding interval. The OSL behaves in a quite different way: the larger the
interval, the smaller the loss function. Moreover, OSL does not prefer any values
inside the interval (e.g., the midpoint) to any other values. Note that, if the data
is completely coherent with a (noise-free) linear model, i.e., if there is a regres-
sion function intersecting all intervals, then any such function will be optimal
for OPT, while this is not necessarily the case for PESS, as PESS may prefer
a function not intersecting all intervals (see Fig. 2 (right) for an illustration).
Obviously, since the OSL is no longer strictly convex (in contrast with PSL), the
optimisation problem solved by OPT may no longer have a unique solution.

Fig. 2. Left: Linear regression with interval-valued data. Right: Comparison between
PESS and OPT for linear regression.

We can also compare OPT and PESS from the point of view of model updating
or revision in the case where new data is observed. Imagine, for example, that
Optimistic and Pessimistic Learning from Imprecise Data 271

a new data point (xN +1 , YN +1 ) is added to the data seen so far. OPT will
check for how compatible its current model is with the interval YN +1 and make
adjustments only if necessary. In particular, if ŷN +1 = hθ (xN +1 ) ∈ YN +1 , i.e.,
the interval includes the current prediction, the model will not be changed at all,
as it is considered fully coherent with the new observation. This also implies that
an extremely wide interval will be ignored as being completely uninformative.
PESS, on the other side, will always change its current estimate θ, unless ŷN +1 =
hθ (xN +1 ) corresponds exactly to the midpoint of YN +1 ; this is because any
deviation from this “perfect” prediction is considered as a mistake (or at least a
suboptimal choice) that ought to be mitigated.
From the above comments, it is clear that the two strategies may behave
quite differently on the same data. OPT assumes that Yn is a set of candidate
values, one of which corresponds to the true measurement. Therefore, fitting
one of these candidates, namely the one that is maximally coherent with the
model assumption and the rest of the data, is enough. As opposed to this, PESS
seeks to fit all values yn ∈ Yn simultaneously, i.e., to find a good compromise
prediction ŷn that is not in conflict with any of the candidates.
It appears that OPT proceeds from a disjunctive interpretation of the set
Yn , and considers that the true data will not be chosen so as to systematically
put the assumed model in default. In contrast, PESS is more in line with a
conjunctive interpretation, which makes sense if all the candidates are indeed
guaranteed to be possible measurements. One could imagine, for example, that
xn actually characterizes a whole set of entities, and that Yn is the collection of
outputs associated with these entities. As an illustration, suppose we would like
to learn a control rule that prescribes an autonomous car the strength of braking
depending on its current speed x. Since the optimal strength will also depend
on other factors (such as weather conditions), which are ignored (or “integrated
out”) here, training examples might be interval-valued. For example, depending
on further unknown conditions, the optimal strength could be in-between ymin
and ymax for a speed of x Km/h. Adopting a “cautious” model, which minimizes
the worst mistake it can make, may look like a reasonable strategy then.

3.2 Logistic Regression

In logistic regression, the goal is to learn a probabilistic classifier


1
hθ (x) = , (12)
1 + exp(−θ, x)

where hθ (x) is an estimate of the (conditional) probability p(y = 1 | x) of the


positive class. Inference is done on the basis of the maximum likelihood principle,
which is equivalent to minimizing the log-loss on the training data:


N
θ∗ = argmin L(yn , hθ (xn ))
θ∈Θ n=1
272 E. Hüllermeier et al.

with

  − log(p) if y = 1
L(y, p) = − log py + (1 − p)(1 − y) =
− log(1 − p) if y = 0

Using the representation (12) for the probability p, and the class encoding Y =
{−1, +1} instead of Y = {0, 1}, the loss can also be written as follows:
 
L(y, s) = log 1 + exp(−ys) ,

where s = θ, x is the predicted score and ys is the margin, i.e., the distance
from the decision boundary (to the right side) (Fig. 3).

Fig. 3. OSL (blue, solid line) and PSL (red, dashed line) for the logistic loss function.
(Color figure online)

Since Y = {−1, +1} contains only two elements, there is only one imprecise
observation that can be made, namely Y = {−1, +1} = Y, and the setting
reduces to so-called semi-supervised learning (with a part of the data being
precisely labeled, and another part without any supervision). Thus, the OSL is
given by

⎨ L(−1, s) if Y = {−1}
LO (Y, s) = L(+1, s) if Y = {+1} ,

min{L(−1, s), L(+1, s)} if Y = {−1, +1}

and the pessimistic version LP by the same expression with min in the third case
replaced by max. As a consequence, if an imprecise observation is made, OPT
will try to disambiguate, i.e., to choose θ such that ys = yθ, x is large (and
hence p is close to 0 or close to 1); this is in line with a large margin approach,
i.e., the learner tries to move the decision boundary away from the data points.
Indeed, the generalized loss LO can be seen as the logistic version of the “hat
loss” that is used in semi-supervised learning of support vector machines [1].
As opposed to this, PESS will try to choose θ such that s ≈ 0 and hence
p ≈ 12 . Obviously, this may lead to drastically different solutions. An example
is shown in Fig. 4, where a few labeled training examples are given (positive
Optimistic and Pessimistic Learning from Imprecise Data 273

in blue and negative in red) and many unlabeled. OPT seeks to maximize the
margin of the decision boundary, and hence puts it in-between the two clusters.
This is in line with the goal of disambiguation: ideally, the unlabeled examples
are far from the decision boundary, which means they are clearly identified as
positive or negative. PESS is doing exactly the opposite and tries to have the
unlabeled examples close to the decision boundary.

Fig. 4. Logistic regression in a semi-supervised setting: Solutions for OPT and PESS.
(Color figure online)

This example suggests that PESS is not really appropriate for tackling dis-
criminative learning tasks. To be fair, however, one has to acknowledge that
PESS may produce more reasonable results in other scenarios. For example, if
the unlabeled examples are not chosen arbitrarily but indeed correspond to those
cases that are very close to the true decision boundary, i.e., for which the pos-
terior probability is indeed close to 12 , and which could hence be hard to label,
then PESS is just doing the right thing.
As another rather extreme example, suppose that the precise observations
in Fig. 4 are just the “noisy” cases, whereas all “normal” cases are hidden (the
blue class is actually in the upper right and the red class in the lower left). One
can imagine, for example, an “adversarial” coarsening process that coarsens all
normal cases and only reveals the noise in the data. In this scenario, it is clear
that OPT will be completely misled and produce exactly the opposite of the
right model. In such adversarial settings [8], PESS (and more generally minimax
approaches) may indeed be considered a more reasonable strategy, as it may
provide some guarantees in terms of protection with regard to the coarsening
process. Anyway, what all these examples are showing is that the reasonableness
of an approach strongly depends on which assumptions about the coarsening
process can be considered as plausible.

3.3 Statistical Parameter Estimation


As already said, OPT and PESS have been introduced in different contexts.
While generalized loss minimization with the OSL was mainly motivated by
274 E. Hüllermeier et al.

problems of supervised machine learning, PESS has mostly been considered in a


setting of statistical parameter estimation, such as the estimation of the param-
eter θ of a Bernoulli distribution in coin tossing. In these cases, OPT may tend
to produce rather extreme estimates. For example, consider a sample such as

1, 0, ?, 0, ?, 1, 1, 1, ?, ? ,

with p positive outcomes indicated by a 1 (e.g., a coin toss landing heads up),
n negative outcomes indicated by a 0, and u unknowns indicated by a ?. One
can check that, in the case where p > n, OPT will produce the estimate θ∗ =
p+u/p+u+n, based on a corresponding disambiguation in which each unknown

is replaced by a positive outcome. More generally, in a multinomial case, all


unknowns are supposed to belong to the majority of the precise part of the data.
This estimate maximizes the likelihood or, equivalently, minimizes the log-loss


N
L(θ) = − Xi log(θ) + (1 − Xi ) log(1 − θ) .
n=1

Such an estimate may appear somewhat implausible. Why should all the
unknowns be positive? Of course, one may not exclude that the coarsening pro-
cess is such that only positives are hidden. In that case, OPT will exactly do the
right thing. Still, the estimate remains rather extreme and hence arguable.
In contrast, PESS would try to maximize the entropy of the estimated dis-
tribution [4, Corollary 1], which is equivalent to having θ∗ = 1/2 in the example
given above. While such an estimate may seem less extreme and more reason-
able, there is again no compelling reason to consider it more (or less) legitimate
than the one obtained by POSS, unless further assumptions are made about
the coarsening process. Finally, note that neither POSS nor PESS can produce
the estimate obtained by the classical coarsening-at-random (CAR) assumption,
which would give θ∗ = 2/3.
As a first remark, let us repeat that generalized loss minimization based
on OSL was actually not intended, or at least not motivated, by this sort of
problem. To explain this point, let us compare the above (statistical estimation)
example of coin tossing with the previous (machine learning) example of logistic
regression. In fact, the former can be seen as a special case of the latter, with
an instance space X = {x0 } consisting of a single instance, such that θ = p(y =
1 | x0 ). Correspondingly, since X has no structure, it is impossible to leverage
any structural assumptions about the sought model h : X −→ Y, which is the
basis of the idea of data disambiguation as performed by OPT.
In particular, in the case of coin flipping, each ? can be replaced by any (hypo-
thetical) outcome, independently of all others and without violating any model
assumptions. In other words, every instantiation of the coarse data is as plausible
as any other. This is in sharp contrast with the case of logistic regression, where
the assumption of a linear model, i.e., the assumption that the probability of
success for an input x depends on the spatial position of that point, lets many
disambiguations appear implausible. For example, in Fig. 5, the instantiation in
Optimistic and Pessimistic Learning from Imprecise Data 275

Fig. 5. Coarse data (left) together with two instantiations (middle and right).

the middle, where half of the unlabeled examples are disambiguated as positive
and the other half as negative, is clearly more coherent with the assumption of
(almost) linearly separable classes than the instantiation on the right, where all
unknowns are assigned to the positive class.
In spite of this, examples like the one of coin tossing are indeed suggesting
that OSL might be overly optimistic in certain cases. Even in discriminative
learning, OSL makes the assumption that the chosen model class is the right
one, which may lead to overly confident results should the model choice be
wrong. This motivates a reconsideration of the optimistic inference principle
and perhaps a suitable adjustment.

4 Adjustments of OSL and PSL

A noticeable property of the previous coin tossing example is a bias of the estima-
tion (or learning) process, which is caused by the fact that a higher likelihood can
principally be achieved with a more extreme θ. For example, with θ ∈ {0, 1}, the
probability of an “ideal” sample is 1, whereas for θ = 1/2, the highest probability
achievable on a sample of size N is (1/2)N . Thus, it seems that, from the very
beginning, the candidate estimate θ = 1/2 is put at a systematic disadvantage.
This can also be seen as follows: Consider any sample produced by θ = 1,
i.e., a sequence of tosses with heads up. When coarsening the data by covering
a subset of the sample, OPT will still produce θ = 1 as an estimate. Roughly
speaking, θ = 1 is “robust” toward coarsening. As opposed to this, when coars-
ening a sample produced with θ = 1/2, OPT will diverge and either produce a
smaller or a larger estimate.

4.1 Regularized OSL

One way to counter a systematic bias in disfavour of certain parameters or


hypotheses is to adopt a Bayesian approach. Instead of looking at the high-
est likelihood value maxy ∈Y p(y | θ) of θ across different instantiations of the
276 E. Hüllermeier et al.

imprecise data2 , one may start with a prior π on θ and look at the highest
posterior3  
p y | θ π(θ)
max   ,
y ∈Y p y
or, equivalently,
N 
  
max log p y | θ − H(θ, y) = max log p(yn | θ) − H(θ, y) (13)
y ∈Y y ∈Y
i=1

with
H(θ, y) ..= log p(y) − log π(θ) (14)
At the level of loss minimization, when ignoring the role of y in (14), this app-
roach essentially comes down to adding a regularization term to the empirical
risk, and hence to minimizing the regularized OSL

1   
N
ROP T .
reg (θ) .= LO Yn , hθ (xn ) + F (hθ ) , (15)
N n=1

where F (hθ ) is a suitable penalty term.


Coming back to our original motivation, namely that some parameters can
principally achieve a higher likelihood than others, one instantiation of F one
may think of is the maximal (log-)likelihood conceivable for θ (where the sample
can be chosen freely and does not depend on the actual imprecise observations):
 
F (θ) = − max log p y | θ (16)
y ∈Y N

In this case, F (θ) can again be moved inside the loss function LO in (15):

1   
N
ROP T .
reg (θ) .= LO Yn , hθ (xn ) (17)
N n=1

with
LO (Y, ŷ) ..= min L(y, ŷ) − min L(y, ŷ) . (18)
y∈Y y∈Y

For some losses, such as squared error loss in regression, the adjustment (18)
has no effect, because L(y, ŷ) = 0 can always be achieved for at least one y ∈ Y.
For others, however, LO may indeed differ from LO . For the log-loss in binary

2
We assume the xn in the data {(xn , yn )}N n=1 to be fixed.
3
The obtained bound are similar to the upper expectation bound obtained by the
updating rule discussed by Zaffalon and Miranda [11] in the case of a completely
unknown coarsening process and precise prior information. However, Zaffalon and
Miranda discussed generic robust updating schemes leading to sets of probabilities
or sets of models, which is not the intent of the methods discussed in this paper.
Optimistic and Pessimistic Learning from Imprecise Data 277

Fig. 6. The adjusted OSL version (19) of the logistic loss (black line) compared to the
original version (red line). (Color figure online)

classification, for example, the normalizing term in (18) is min{L(0, p), L(1, p)},
which means that

⎨ log(1 − p) − log(p) if Y = {1}, p < 1/2
LO (Y, p) = log(p) − log(1 − p) if Y = {0}, p > 1/2 . (19)

0 otherwise

A graphical representation of this loss function, which can be seen as a combi-


nation of the 0/1 loss (it is 0 for signed probabilities ≥ 1/2) and the log-loss, is
shown in Fig. 6.

4.2 Adjustment of PSL: Min-Max Regret


Interestingly, a similar adjustment, called min-max regret criterion, has recently
been proposed for PESS [6]. The motivation of the latter, namely to assess a
parameter θ in a relative rather than absolute way, is quite similar to ours.
Adopting our notation, a candidate θ is evaluated in terms of

   
max log p y | θ − max log p y | θ̂ . (20)
y ∈Y θ̂

That is, θ is assessed on a concrete instantiation y ∈ Y by comparing it to the


best estimation θ̂y on that data, which defines the regret, and then the worst
comparison over all possible instantiations (the maximum regret) is considered.
Like in the case of OSL, this can again be seen as an approximation of (14) with
 
F (y) = max log p y | θ̂ ,
θ̂

which now depends on y but not on θ (whereas the F in (15) depends on θ but
not on y). Obviously, the min-max regret principle is less pessimistic than the
original PSL, and leads to an adjustment of PESS that is even somewhat compa-
rable to OPT: The loss of a candidate θ on an instantiation y is corrected by the
278 E. Hüllermeier et al.

18

16

14

12

10

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fig. 7. Loss functions and optimal predictions of θ (minima of the losses indicated by
vertical lines) in the case of coin tossing with observations 0, 0, 1, 1, 1, 1, ?, ?, ?: solid
blue line for OSL, dashed blue for the regularized OSL version (14) with π the beta
(5,5) distribution, solid red for PSL, and dashed red for the adjusted PSL (20). (Color
figure online)

minimal loss F (y) that can be achieved on this instantiation. Obviously, by doing
so, the influence of instantiations that necessarily cause a high loss is reduced.
But these instantiations are exactly those that are considered as “implausible”
and down-weighted by OPT (cf. Sect. 3.3). See Fig. 7 for an illustrative compar-
ison in the case of coin tossing as discussed in Sect. 3.3. Note that (20) does not
permit an additive decomposition into losses on individual training examples,
because the regret is defined on the entire set of data. Instead, a generalization
of (20) to loss functions other than log-loss suggests evaluating each θ in terms
of the maximal regret
 
MReg(θ) ..= max Remp (θ, y) − min Remp (θ̂, y) , (21)
y ∈Y θ̂

where Remp (θ, y) denotes the empirical risk of θ on the data obtained for the
instantiation y. Computing the maximal regret (21), let alone finding the mini-
mizer θ∗ = argminθ MReg(θ), appears to be intractable except for trivial cases.
In particular, the problem will be hard in cases like logistic regression, where
the empirical risk minimizer minθ̂ Remp (θ̂, y) cannot be obtained analytically,
because then even the evaluation of a single candidate θ on a single instantia-
tion y requires the solution of a complete learning task—not to mention that
the minimization over all instantiations y comes on top of this.

5 Concluding Remarks
The goal of our discussion was to provide some insight into the basic nature of
the “optimistic” and the “pessimistic” approach to learning from imprecise data.
To this end, we presented both of them in a unified framework and highlighted
important properties and differences through illustrative examples.
Optimistic and Pessimistic Learning from Imprecise Data 279

As future work, we plan a more thorough comparison going beyond anecdotal


evidence. Even if both approaches deliberately refrain from specific assumptions
about the coarsening process, it would be interesting to characterize situations
in which they are likely to produce accurate results, perhaps even with formal
guarantees, and situations in which they may fail. In addition to a formal analysis
of that kind, it would also be interesting to compare the approaches empirically.
This is not an easy task, however, especially due to a lack of suitable (real)
benchmark data. Synthetic data can of course be used as well, but as our exam-
ples have shown, it is always possible to create the data in favour of the one and
in disfavour of the other approach.

References
1. Chapelle, O., Sindhwani, V., Keerthi, S.S.: Optimization techniques for semi-
supervised support vector machines. J. Mach. Learn. Res. 9, 203–233 (2008)
2. Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. J. Mach. Learn. Res.
12, 1501–1536 (2011)
3. Couso, I., Sánchez, L.: Machine learning models, epistemic set-valued data and
generalized loss functions: an encompassing approach. Inf. Sci. 358, 129–150 (2016)
4. Guillaume, R., Couso, I., Dubois, D.: Maximum likelihood with coarse data based
on robust optimisation. In: Proceedings of the Tenth International Symposium on
Imprecise Probability: Theories and Applications, pp. 169–180 (2017)
5. Guillaume, R., Dubois, D.: Robust parameter estimation of density functions under
fuzzy interval observations. In: 9th International Symposium on Imprecise Proba-
bility: Theories and Applications (ISIPTA 2015), pp. 147–156 (2015)
6. Guillaume, R., Dubois, D.: A maximum likelihood approach to inference under
coarse data based on minimax regret. In: Destercke, S., Denoeux, T., Gil, M.Á.,
Grzegorzewski, P., Hryniewicz, O. (eds.) SMPS 2018. AISC, vol. 832, pp. 99–106.
Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97547-4 14
7. Hüllermeier, E.: Learning from imprecise and fuzzy observations: data disambigua-
tion through generalized loss minimization. Int. J. Approxim. Reasoning 55(7),
1519–1534 (2014)
8. Laskov, P., Lippmann, R.: Machine learning in adversarial environments. Mach.
Learn. 81(2), 115–119 (2010)
9. Liu, L.P., Dietterich, T.G.: A conditional multinomial mixture model for superset
label learning. In: Proceedings NIPS (2012)
10. Nguyen, N., Caruana, R.: Classification with partial labels. In: 14th International
Conference on Knowledge Discovery and Data Mining Proceedings KDD, Las
Vegas, USA, p. 2008 (2008)
11. Zaffalon, M., Miranda, E.: Conservative inference rule for uncertain reasoning
under incompleteness. J. Artif. Intell. Res. 34, 757–821 (2009)
On Cautiousness and Expressiveness
in Interval-Valued Logic

Sébastien Destercke(B) and Sylvain Lagrue

Université de Technologie de Compiègne, CNRS, UMR 7253 - Heudiasyc,


Centre de Recherche de Royallieu, Compiègne, France
{sebastien.destercke,sylvain.lagrue}@hds.utc.fr

Abstract. In this paper, we study how cautious conclusions should


be taken when considering interval-valued propositional logic, that is
logic where to each formula is associated a real-valued interval providing
imprecise information about the penalty incurred for falsifying this for-
mula. We work under the general assumption that the weights of falsified
formulas are aggregated through a non-decreasing commutative function,
and that an interpretation is all the more plausible as it is less penalized.
We then formulate some dominance notions, as well as properties that
such notions should follow if we want to draw conclusions that are at
the same time informative and cautious. We then discuss the dominance
notions in light of such properties.

Keywords: Logic · Imprecise weights · Skeptic inference · Robust


inferences · Penalty logic

1 Introduction
Logical frameworks have always played an important role in artificial intelli-
gence, and adding weights to logical formulas allow one to deal with a variety of
problems with which classical logic struggles [3].
Usually, such weights are assumed to be precisely given, and associated to
an aggregation function, such as the maximum in possibilistic logic [4] or the
sum in penalty logic [5]. These approaches can typically find applications in
non-monotonic reasoning [1] or preference handling [7].
However, as providing specific weights to each formula is likely to be a cog-
nitively demanding tasks, many authors have considered extensions of these
frameworks to interval-valued weights [2,6], where intervals are assumed to con-
tain the true, ill-known weights. Such approaches can also be used, for instance,
to check how robust conclusions obtained with precise weights are.
In this paper, we are interested in making cautious or robust inferences in
such interval-valued frameworks. That is, we look for inference tools that will
typically result in a partial order over the interpretations or world states, such
that any preference statement made by this partial order is made in a skeptic
way, i.e., it holds for any replacement of the weights by precise ones within the
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 280–288, 2019.
https://doi.org/10.1007/978-3-030-35514-2_21
On Cautiousness and Expressiveness in Interval-Valued Logic 281

intervals, and should not be reversed when gaining more information. We simply
assume that the weights are positive and aggregated by a quite generic function,
meaning that we include for instance possibilistic and penalty logics as special
cases.
We provide the necessary notations and basic material in Sect. 2. In Sect. 3,
we introduce different ways to obtain partial orders over interpretations, and
discuss different properties that corresponding cautious inference tools could or
should satisfy. Namely, that reducing the intervals will provide more informative
and non-contradictory inferences, and that if an interpretation falsify a subset
of formulas falsified by another one, then it should be at least as good as this
latter one. Section 4 shows which of the introduced inference tools satisfy which
property.

2 Preliminaries
We consider a finite propositional language L. We denote by Ω the space of all
interpretations of L, and by ω an element of Ω. Given a formula φ, ω is a model
of φ if it satisfies it, denoted ω |= φ.
A weighted formula is a tuple φ, α where α represents the importance
of the rule, and the penalty incurred if it is not satisfied. This weight may be
understood in various ways: as a degree of certainty, as a degree of importance of
an individual preference, . . . . We assume that α take their values on an interval
of R+ , possibly extended to include ∞ (e.g., to represent formulas that cannot
be falsified). In this paper, a formula with α = 0 is understood as a totally
unimportant formula that can be ignored, while a formula with maximal α is a
formula that must be satisfied.
A (precisely) weighted knowledge base K = {φi , αi  : i = 1, . . . , n} is a set of
distinct weighted formulas. Since these formulas are weighted, an interpretation
can (and sometimes must, if K without weights is inconsistent) falsify some of
them, and still be considered as valid. In order to determine an ordering between
different interpretations, we introduce two new notations:

– FK (ω) = {φi : ω |= φi }, the set of formulas falsified by ω


– FK (ω \ ω  ) = {φi : ω |= φi ∧ ω  |= φi }

Let us furthermore consider an aggregation function ag : Rn → R that we


assume to be non-decreasing, commutative, continuous and well-defined1 for any
finite number n.
We consider that ag({αi : φi ∈ F }) applied to a subset F of formulas of K
measure the overall penalty corresponding to F , with ag(∅) = 0. Given this, we
also assume that if ag receives two vectors a and b of dimensions n and n + m
such that b has the same first n elements as a, i.e., b = (a, y1 , . . . , ym ), then
ag(a) ≤ ag(b). The idea here is that adding (falsified) formulas to a can only
increase the global penalty. Classical options correspond to possibilistic logic
1
As we do not necessarily assume it to be associative.
282 S. Destercke and S. Lagrue

(weights are
 in [0, 1] and ag = max) or penalty logic (weights are positive reals
and ag = ). Based on this aggregation function, we define a given K the two
following complete orderings between interpretations when weights are precise:
– ω K  
All ω iff ag({αi : φi ∈ FK (ω)}) ≤ ag({αi : φi ∈ FK (ω )}).
– ω Dif f ω iff ag({αi : φi ∈ FK (ω \ ω )}) ≤ ag({αi : φi ∈ FK (ω  \ ω)}).
K  

Both orderings can be read as ω ω  meaning that “ω is more plausible, or


preferred to ω  , given K”.
When the weights are precise, it may be desirable for All and Dif f to be
consistent, that is not to have ω K  K 
All ω and ω ≺Dif f ω for a given K. It may
be hard to characterize the exact family of functions ag that will satisfy this,
but we can show that adding associativity and strict increasigness2 to the other
mentioned properties ensure that results will be consistent.

Proposition 1. If ag is continuous, commutative, strictly increasing and asso-


ciative, then given a knowledge base K, we have that

ω K  K
All ω ⇔ ω Dif f ω


Proof. Let us denote the sets {αi : φi ∈ FK (ω)} and {αi : φi ∈ FK (ω  )} as real-
valued vectors a = (x1 , . . . , xn , yn+1 , . . . , yna ) and b = (x1 , . . . , xn , zn+1 , . . . ,
znb ), where x1 , . . . , xn are the weights associated to the formulas that both
interpretations falsify. Showing the equivalence of Proposition 1 then comes down
to show

ag(a) ≥ ag(b) ⇔ ag((yn+1 , . . . , yna )) ≥ ag((zn+1 , . . . , znb )).

Let us first remark that, due to associativity,

ag(a) = ag(ag((x1 , . . . , xn )), ag((yn+1 , . . . , yna ))) := ag(A, B),

ag(b) = ag(ag((x1 , . . . , xn )), ag((zn+1 , . . . , znb ))) := ag(A, C).


Under these notations, we must show that ag(A, B) ≥ ag(A, C) ⇔ B ≥ C.
That B ≥ C ⇒ ag(A, B) ≥ ag(A, C) is immediate, as ag is non-decreasing.
To show that B ≥ C ⇐ ag(A, B) ≥ ag(A, C), we can just see that if B < C, we
have ag(A, B) < ag(A, C) due to the strict increasingness of ag.

3 Interval-Valued Logic, Dominance Notions and


Properties
In practice, it is a strong requirement to ask users to provide precise weights for
each formula, and they may be more comfortable in providing imprecise ones.
This is one of the reason why researchers proposed to extend weighted logics to
interval-valued logics, where the knowledge base is assumed to have the form
2
Which is also necessary, as ag = max will not always satisfy Property 1.
On Cautiousness and Expressiveness in Interval-Valued Logic 283

K = {φi , Ii  : i = 1, . . . , n} with Ii = [ai , bi ] representing an interval of possible


weights assigned to φi .
In practice, this means that the result of applying ag to a set of formulas F
is no longer a precise value, but an interval [ag, ag]. As ag is a non-decreasing
continuous function, computing this interval is quite easy as

ag = ag({ai |φi ∈ F }),

ag = ag({bi |φi ∈ F }),


which means that if the problem with precise weights is easy to solve, then
solving it for interval-valued weights is equally easy, as it amounts to solve twice
the problems for specific precise weights (i.e., the lower and upper bounds).
A question is now to know how we should rank the various interpretations in
a cautious way given these interval-valued formulas. In particular, this means
that the resulting order between interpretations should be a partial order if we
have no way to know whether one has a higher score than the other, given our
imprecise information. But at the same time, we should try to not lose too much
information by making things imprecise.
There are two classical ways to compare interval-valued scores that results
in possible incomparabilities:

– Lattice ordering: [a, b] L [c, d] iff a ≤ c and b ≤ d. We then have [a, b] ≺L


[c, d] if one of the two inequalities is strict, and [a, b]  [c, d] iff [a, b] = [c, d].
Incomparability of [a, b] and [c, d] corresponds to one of the two set being
strictly included in the other.
– Strict ordering: [a, b] S [c, d] iff b ≤ c. We then have [a, b] ≺S [c, d] if b < c,
and indifference will only happen when a = b = c = d (hence never if intervals
are non-degenerate). Incomparability of [a, b] and [c, d] corresponds to the two
sets overlapping.

These two orderings can then be applied either to All or Dif f , resulting in
four different extensions: All,L , All,S , Dif f,L , Dif f,S . A first remark is that
strict comparisons are stronger than lattice ones, as the former imply the latter,
that is if [a, b] S [c, d], then [a, b] L [c, d]. In order to decide which of these
orderings are the most adequate, let us first propose some properties they should
follow when one wants to perform cautious inferences.

Property 1 (Informational monotonicity). Assume that we have two knowl-


edge bases K 1 = {φ1i , Ii 1  : i = 1, . . . , n} and K 2 = {φ2i , Ii 2  : i = 1, . . . , n}
with φ1i = φ2i and Ii 1 ⊆ Ii 2 for all i. An aggregation method and the partial order
it induces on interpretations is informational monotonic if

ω K2 ω  =⇒ ω K1 ω 

That is, the more we gain information, the better we become at differenti-
ating and ranking interpretations. If ω is strictly preferred to ω  before getting
more precise assessments, it should remain so after the assessments become more
284 S. Destercke and S. Lagrue

precise3 . A direct consequence of Property 1 is that we cannot have ω K2 ω 


and ω  K1 ω, meaning that K1 will be a refinement of K2 . This makes
sense if we aim for a cautious behaviour, as the conclusion we make in terms of
preferred interpretations should be guaranteed, i.e., they should not be revised
when we become more precise.
It should also be noted that violating this property means that the corre-
sponding partial order is not skeptic in the sense advocated in the introduction,
as a conclusion taken at an earlier step can be contradicted later on by gaining
more information.

Property 2 (subset/implication monotonicity). Assume that we have a


knowledge base K. An aggregation method and the partial order it induces on
interpretations follows subset monotonicity if

FK (ω) ⊆ FK (ω  ) =⇒ ω K ω  for any pair ω, ω 

This principle is quite intuitive: if we are sure that ω  falsifies the same
formulas than ω in addition to some others, then certainly ω  should be less
preferable/certain than ω.

4 Discussing Dominance Notions

Let us now discuss the different partial orders in light of these properties, starting
with the lattice orderings and then proceeding to interval orderings.

4.1 Lattice Orderings

Let us first show that All,L , Dif f,L do not satisfy Property 1 in general, by
considering the following example:

Example 1. Consider the case where ai , bi ∈ R and ag = , with the following
knowledge base on the propositional variables {p, q}

φ1 = p, φ2 = p ∧ q, φ3 = ¬q

with the three following sets (respectively denoted K 1 , K 2 , K 3 ) of interval-valued


scores
I1K1 = [2.5, 2.5], I2K1 = [0, 4], I3K1 = [1, 5],
I1K2 = [2.5, 2.5], I2K2 = [4, 4], I3K2 = [1, 5],
K3 K3
I1 = [2.5, 2.5], I2 = [4, 4], I3K3 = [1, 1],
that are such that I K3 ⊆ I K2 ⊆ I K1 for all formulas. The resulting scores
using the choice All following on the different interpretations are summarised
in Table 1.
3
Note that we consider the new assessments to be consistent with the previous ones,
as Ii 1 ⊆ Ii 2 .
On Cautiousness and Expressiveness in Interval-Valued Logic 285

Table 1. Interval-valued scores from Example 1

p q ag 1 ag 2 ag 3
ω0 0 0 [2.5,6.5] [6.5,6.5] [6.5,6.5]
ω1 0 1 [3.5, 11.5] [7.5,11.5] [7.5,7.5]
ω2 1 0 [0, 4] [4, 4] [4, 4]
ω3 1 1 [1, 5] [1, 5] [1, 1]

Figure 1 shows the different partial orders between the interpretations,


according to All,L . We can see that ω2 and ω3 go from comparable to incom-
parable when going from K K2
All,L to All,L , and that the preference or ranking
1

between them is even reversed when going from K K3


All,L to All,L .
1

ω2 ω2 ω3 ω3

ω3 ω2
ω0
ω0 ω0

ω1 ω1 ω1

K1
All,L K2
All,L K3
All,L

Fig. 1. Orderings All,L of Example 1 on interpretations.

It should be noted that what happens to ω2 , ω3 for All,L is also true for
Dif f,L . Indeed, FK (ω2 ) = {p ∩ q} and FK (ω3 ) = {¬q}, hence FK (ω2 \ ω3 ) =
FK (ω2 ) and FK (ω3 \ ω2 ) = ∅. However, we can show that the two orderings
based on lattice do satisfy subset monotonicity.
Proposition 2. Given a knowledge base K, the two orderings K K
All,L , Dif f,L
satisfy subset monotonicity.
Proof. For Dif f,L , it is sufficient to notice that if FK (ω) ⊆ FK (ω  ), then FK (ω\
ω  ) = FK (∅). This means that ag({αi : φi ∈ FK (ω \ ω  )}) = [0, 0], hence we
necessarily have ω K Dif f,L ω .


For All,L , the fact that FK (ω) ⊆ FK (ω  ) means that the vectors a and a
of lower values associated to {αi : φi ∈ FK (ω)} and {αi : φi ∈ FK (ω  )} will be
of the kind a = (a, a1 , . . . , am ), hence we will have ag(a) ≤ ag(a ). The same
reasoning applied to upper bounds means that we will also have ag(a) ≤ ag(a ),
meaning that ω K 
All,L ω .
286 S. Destercke and S. Lagrue

From this, we deduce that lattice orderings will tend to be too informative
for our purpose4 , i.e., they will induce preferences between interpretations that
should be absent if we want to make only those inferences that are guaranteed
(i.e., hold whatever the value chosen within the intervals Ii ).

4.2 Strict Orderings


In this section, we will study strict orderings, and will show in particular that
while All,S provides orderings that are not informative enough for our purpose,
Dif f,S does satisfy our two properties.
As we did for lattice orderings, let us first focus on the notion of informational
monotonicity, and show that both orderings satisfy it.

Proposition 3. Given knowledge bases K 1 , K 2 with φ1i = φ2i and Ii 1 ⊆ Ii 2 for


all i ∈ {1, . . . , n}, the two orderings All,S , Dif f,S satisfy information mono-
tonicity.

Proof. Assume that [a, b] and [c, d] are the intervals obtained from K 2 respec-
tively for ω and ω  after aggregation has been performed, with b ≤ c, hence
ω K 
,S ω with  ∈ {All, Dif f }.
2

Since ag is an increasing function, and as Ii 1 ⊆ Ii 2 , we will have that the


intervals [a , b ] and [c , d ] obtained from K 1 for ω and ω  after aggregation will
be such that [a , b ] ⊆ [a, b] and [c , d ] ⊆ [c, d], meaning that b ≤ b ≤ c ≤ c ,
hence ω K 1 
,S ω , and this finishes the proof.

Let us now look at the property of subset monotonicity. From the knowledge
base K 1 in Example 1, one can immediately see that All,S is not subset mono-
tonic, as FK (ω3 ) ⊆ FK (ω1 ) and FK (ω2 ) ⊆ FK (ω0 ) ⊆ FK (ω1 ), yet all intervals in
Table 1 overlap, meaning that all interpretations are incomparable. Hence All,S
will usually not be as informative as we would like a cautious ranking procedure
to be. This is mainly due to the presence of redundant variables, or common
formulas, in the comparison of interpretations. In contrast, Dif f,S does not
suffer from the same defect, as the next proposition shows.

Proposition 4. Given a knowledge base K, the ordering Dif f,S satisfies subset
monotonicity.

Proof. As for Proposition 2, it is sufficient to notice that if FK (ω) ⊆ FK (ω  ),


then FK (ω \ ω  ) = FK (∅). This means that ag({αi : φi ∈ FK (ω \ ω  )}) = [0, 0],
hence we necessarily have ω K 
Dif f,L ω .

Hence, the ordering Dif f,S satisfies all properties we have considered desir-
able in our framework. It does not add unwanted comparisons, while not losing
information that could be deduced without knowing the weights.

4
Which does not prevent them to be suitable for other purposes.
On Cautiousness and Expressiveness in Interval-Valued Logic 287

Example 2. If we consider the knowledge base K 1 of Example 1, using Dif f,S


we could only deduce the rankings induced by the facts that FK (ω3 ) ⊆ FK (ω1 )
and FK (ω2 ) ⊆ FK (ω0 ) ⊆ FK (ω1 ), as ω3 does not falsify any of the formulas that
ω2 and ω0 falsify, hence we can directly compare their intervals. The resulting
ordering is pictured in Fig. 2.

ω2

ω0 ω3

ω1

Fig. 2. Orderings Dif f,S of Example 2

5 Conclusions
In this paper, we have looked at the problem of making cautious inferences in
weighted logics when weights are interval-valued, and have made first proposals
to make such inferences. There is of course a lot that remains to be done, such
as studying expressivity, representational or computational issues.
It should also be noted that our approach can easily be extended to cases
where weights are given by other uncertainty models. If Ii is an uncertain quan-
tity (modelled by a fuzzy set, a belief function, a probability, . . . ), we would then
need to specify how to propagate them to obtain ag(F ), and how to compare
these uncertain quantities.

References
1. Benferhat, S., Dubois, D., Prade, H.: Possibilistic and standard probabilistic seman-
tics of conditional knowledge bases. J. Logic Comput. 9(6), 873–895 (1999)
2. Benferhat, S., Hué, J., Lagrue, S., Rossit, J.: Interval-based possibilistic logic. In:
Twenty-Second International Joint Conference on Artificial Intelligence (2011)
3. Dubois, D., Godo, L., Prade, H.: Weighted logics for artificial intelligence - an intro-
ductory discussion. Int. J. Approx. Reason. 9(55), 1819–1829 (2014)
4. Dubois, D., Prade, H.: Possibilistic logic: a retrospective and prospective view. Fuzzy
Sets Syst. 144(1), 3–23 (2004)
5. Dupin De Saint-Cyr, F., Lang, J., Schiex, T.: Penalty logic and its link with
Dempster-Shafer theory. In: Uncertainty Proceedings 1994, pp. 204–211. Elsevier
(1994)
288 S. Destercke and S. Lagrue

6. Gelain, M., Pini, M.S., Rossi, F., Venable, K.B., Wilson, N.: Interval-valued soft
constraint problems. Ann. Math. Artif. Intell. 58(3–4), 261–298 (2010)
7. Kaci, S., van der Torre, L.: Reasoning with various kinds of preferences: logic, non-
monotonicity, and algorithms. Ann. Oper. Res. 163(1), 89–114 (2008)
Preference Elicitation with Uncertainty:
Extending Regret Based Methods
with Belief Functions

Pierre-Louis Guillot and Sebastien Destercke(B)

Heudiasyc laboratory, 60200 Compiègne, France


[email protected]

Abstract. Preference elicitation is a key element of any multi-criteria


decision analysis (MCDA) problem, and more generally of individual
user preference learning. Existing efficient elicitation procedures in the
literature mostly use either robust or Bayesian approaches. In this paper,
we are interested in extending the former ones by allowing the user to
express uncertainty in addition of her preferential information and by
modelling it through belief functions. We show that doing this, we pre-
serve the strong guarantees of robust approaches, while overcoming some
of their drawbacks. In particular, our approach allows the user to contra-
dict herself, therefore allowing us to detect inconsistencies or ill-chosen
model, something that is impossible with more classical robust methods.

Keywords: Belief functions · Preference elicitation · Multicriteria


decision

1 Introduction

Preference elicitation, the process through which we collect preference from a


user, is an important step whenever we want to model her preferences. It is
a key element of domains such as multi-criteria decision analysis (MCDA) or
preference learning [7], where one wants to build a ranking model on multivariate
alternatives (characterised by criteria, features, . . . ). Our contribution is more
specific to MCDA, as it focuses on getting preferences from a single user, and
not a population of them.
Note that within this setting, preference modelling or learning can be asso-
ciated with various decision problems. Such problems most commonly include
the ranking problem that consists in ranking alternatives from best to worst,
the sorting problem that consists in classifying alternatives into ordered classes,
and finally the choice problem that consists in picking a single best candidate
among available alternatives. This article only deals with the choice problem but
can be extended towards the ranking problem in a quite straightforward manner
– commonly known as the iterative choice procedure – by considering a ranking
as a series of consecutive choices [1].
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 289–309, 2019.
https://doi.org/10.1007/978-3-030-35514-2_22
290 P.-L. Guillot and S. Destercke

In order for the expert to make a recommendation in MCDA, she must


first restrict her search to a set of plausible MCDA models. This is often done
accordingly to a priori assumptions on the decision making process, possibly
constrained by computational considerations.
In this paper, we will assume that alternatives are characterised by q real
values, i.e. are represented by a vector in Rq , and that preferences over them
can be modelled by a value function f : Rq → R such that a  b iff f (a) > f (b).
More specifically, we will look at weighted averages. The example below illustrate
this setting. Our results can straightforwardly be extended to other evaluations
functions (Choquet integrals, GAI, . . . ) in theory, but would face additional
computational issues that would need to be solved.
Example 1 (choosing the best course). Consider a problem in which the DM is
a student wanting to find the best possible course in a large set of courses, each
of which has been previously associated a grade from 0 to 10–0 being the least
preferred and 10 being the most preferred – according to 3 criteria: usefulness,
pedagogy and interest. The expert makes the assumption that the DM evaluates
each course according to a score computed by a weighted sum of its 3 grades.
This is a strong assumption as it means for example that an increase of 0.5 in
usefulness will have the same impact on the score regardless of the grades in
pedagogy and interest. In such a set of models, a particular model is equivalent
to a set of weights in R3 . Assume that the DM preferences follow the model given
by the weights (0.1, 0.8, 0.1), meaning that she considers pedagogy to be eight
time as important as usefulness and interest which are of equal importance.
Given the grades reported in Table 1, she would prefer the Optimization course
over the Machine learning course, as the former would have a 5.45 value, and
the later a 3.2 value.

Table 1. Grades of courses

usefulness pedagogy interest usefulness pedagogy interest


Machine learning: Optimization:
8.5 1.5 10 3 5.5 2

usefulness pedagogy interest usefulness pedagogy interest


Linear algebra: Graph theory:
7 5 5.5 1 2 6

Beyond the choice of a model, the expert also needs to collect or elicit prefer-
ences that are specific to the DM, and that she could not have guessed according
to a priori assumptions. Information regarding preferences that are specific to
the DM can be collected by asking them to answer questions in several form
such as the ranking of a subset of alternatives from best to worst or the choice
of a preferred candidate among a subset of alternatives.
Example 2 (choosing the best course (continued)). In our example, directly ask-
ing for weights would make little sense (as our model may be wrong, and as the
Preference Elicitation with Uncertainty: Extending Regret Based Methods 291

user cannot be expected to be an expert of the chosen model). A way to get


this information from her would therefore be to ask her to pick her favorite out
of two courses. Let’s assume that when asked to choose between Optimization
and Graph theory, she prefers the Optimisation course. The latter being better
than the former in pedagogy and worse in interest, her answer is compatible with
weights (0.05, 0.9, 0.05) (strong preference for pedagogy over other criteria) but
not with (0.05, 0.05, 0.9) (strong preference for interest over other criteria). Her
answer has therefore given the expert additional information on the preferen-
tial model underlying her decision. We will see later that this generates a linear
constraint over the possible weights.

Provided we have made some preference model assumptions (our case here),
it is possible to look for efficient elicitation methods, in the sense that they solve
the decision problem we want to solve in a small enough, if not minimal number
of questions. A lot of work has been specifically directed towards active elicitation
methods, in which the set of questions to ask the DM is not given in advance
but determined on the fly. In robust methods, this preferential information is
assumed to be given with full certainty which leads to at least two issues. The
first one is that elicitation methods thus do not account for the fact that the
DM might doubt her own answers, and that they might not reflect her actual
preferences. The second one, that is somehow implied by the first one, is that
most robust active elicitation methods will never put the DM in a position
where she could contradict either herself or assumptions made by the expert, as
new questions will be built on the basis that previous answers are correct and
hence should not be doubted. This is especially problematic when inaccurate
preferences are given early on, or when the preference model is based on wrong
assumptions.
This paper presents an extension of the Current Solution Strategy [3] that
includes uncertainty in the answers of the DM by using the framework based on
belief functions presented in [5]. Section 2 will present necessary preliminaries on
both robust preference elicitation based on regret and uncertainty management
based on belief functions. Section 3 will present our extension and some of the
associated theoretical results and guarantees. Finally Sect. 4 will present some
first numerical experiments that were made in order to test the method and its
properties in simulations.

2 Preliminaries
2.1 Formalization
Alternatives and Models: We will denote X the space of possible alternatives,
and X ⊆ X the subset of available alternatives at the disposal of our DM
and about which a recommendation needs to be made. In this paper we will
consider alternatives summarised by q real values corresponding to criteria, hence
X ⊆ Rq . For any x ∈ X and 1 ≤ i ≤ q, we denote by xi ∈ R the evaluation
of alternative x according to criterion i. We also assume that for any x, y ∈ X
292 P.-L. Guillot and S. Destercke

such that xi > y i for some i ∈ {1, . . . , q} and xl ≥ y l , ∀l ∈ {1, . . . , q} \ {i}, x


will always be strictly preferred to y – meaning that preferences respect ceteris
paribus monotonicity, and we assume that criteria utility scale is given.
X is a finite set of k alternative such that X = {x1 , x2 , . . . , xk } with xj the
j-th alternative of X. Let P(X) be a preference relation over X, and x, y ∈ X
be two alternatives to compare. We will state that x P y if and only if x is
strictly preferred to y in the corresponding relation, x P y if and only if x
and y are equally preferred in the corresponding relation, and x
P y if and
only if either x is strictly preferred to y or x and y are equally preferred.

Preference Modelling and Weighted Sums: In this work, we focus on the


case where the hypothesis set Ω of preference models is the set of weighted
sum models1 . A singular model ω will be represented by its vector of weights
in Rq , and ω will be used to describe indifferently the decision model and the
corresponding weight vector. Ω can therefore be described as:
 q


q i i
Ω = ω ∈ R : ω ≥ 0 and ω =1 .
i=1

Each model ω is associated to the corresponding aggregating evaluation function


q

fω (x) = ω i xi ,
i=1

and any two potential alternatives x, y in X can then be compared by comparing


their aggregated evaluation:

x
ω y ⇐⇒ fω (x) ≥ fω (y) (1)

which means that if the model ω is known, Pω (X) is a total preorder over
X, the set of existing alternatives. Note that Pω (X) can be determined using
pairwise relations
ω . Weighted averages are a key model of preference learning
whose linearity usually allows the development of efficient methods, especially in
regret-based elicitation [2]. It is therefore an ideal starting point to explore other
more complex functions, such as those that are linear in their parameters once
alternatives are known (i.e., Choquet integrals, Ordered weighted averages).

2.2 Robust Preference Elicitation

In theory, obtaining a unique true preference model requires both unlimited time
and unbounded cognitive abilities. This means that in practice, the best we can
do is to collect information identifying a subset Ω  of possible models, and act
1
In principle, our methods apply to any value function with the same properties,
but may have to solve computational issue that depends on the specific chosen
hypothesis.
Preference Elicitation with Uncertainty: Extending Regret Based Methods 293

accordingly. Rather than choosing a unique model within Ω  , robust methods


usually look at the inferences that hold for every model in Ω  . Let Ω  be the
subset of models compatible with all the given preferential information, then we
can define PΩ  (X), a partial preorder of robust preferences over X, as follows:

x
Ω  y ⇐⇒ ∀ω ∈ Ω  fω (x) ≥ fω (y). (2)

The research question we address here is to find elicitation strategies that reduce
Ω  as quickly as possible, obtaining at the limit an order PΩ  (X) having only
one maximal element2 . In practice, one may have to stop collecting information
before that point, explaining the need for heuristic indicators of the fitness of
competing alternatives as potential choices.

Regret Based Elicitation: Regret is a common way to assess the potential


loss of recommending a given alternative under incomplete knowledge. It can
help both the problem of making a recommendation and finding an efficient
question. Regret methods use various indicators, such as the regret Rω (x, y) of
choosing x over y according to model ω, defined as

Rω (x, y) = fω (y) − fω (x). (3)

From this regret and a set Ω  of possible models, we can then define the pairwise
max regret as

PMR(x, y, Ω  ) = max Rω (x, y) = max (fω (y) − fω (x)) (4)


ω∈Ω ω∈Ω

that corresponds to the maximum possible regret of choosing x over y for any
model in Ω  . The max regret for an alternative x defined as

MR(x, Ω  ) = max PMR(x, y, Ω  ) = max max (fω (y) − fω (x)) (5)


y∈X y∈X ω∈Ω

then corresponds to the worst possible regret one can have when choosing x.
Finally the min max regret over a subset of models Ω  is

mMR(Ω  ) = min MR(x, Ω  ) = min max max (fω (y) − fω (x)) (6)
x∈X x∈X y∈X ω∈Ω

Picking as choice x∗ = arg min mMR(Ω  ) is then a robust choice, in the sense
that it is the one giving the minimal regret in a worst-case scenario (the one
leading to max regret).

Example 3 (choosing the best course (continued)). Let X = [0, 10]3 be the set of
valid alternatives composed of 3 grades from 0 to 10 in respectively pedagogy,
usefulness and interest. Let X = {x1 , x2 , x3 , x4 } be the set of available alterna-
tives in which x1 corresponds to the Machine learning course, x2 corresponds
2
Or in some cases a maximal set {x1 , . . . , xp } of equally preferred elements s.t. x1 
. . .  xp .
294 P.-L. Guillot and S. Destercke

to the Optimization course, x3 corresponds to the Linear algebra course and x4


corresponds to the Graph theory course, as reported in Table 1. Let x, y ∈ X
be two alternatives and Ω the set of weighted sum models, PMR(x, y, Ω) can 
be computed by optimizing maxω∈Ω ω 1 (x1 − y 1 ) + ω 2 (x2 − y 2 ) + ω 3 (x3 − y 3 ) .
As this linear function of ω is optimized over a convex polytope Ω, it can eas-
ily be solved exactly using linear programming (LP). Results of PMR(x, y, Ω)
and MR(x, Ω) are shown in Table 2. In this example, x1 is the alternative with
minimum max regret, and the most conservative candidate to answer the choice
problem according to regret.

Table 2. Values of PMR(x, y, Ω) (left) and MR(x, Ω) (right)

@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5
x2 8
x2 8 0 4 4
x3 4.5
x3 4.5 0.5 0 0.5
x4 7.5
x4 7.5 3.5 6 0

Regret indicators are also helpful for making the elicitation strategy efficient
and helping the expert ask relevant questions to the DM. Let Ω  and Ω  be
two sets of models such that mMR(Ω  ) < mMR(Ω  ). In the worst case, we are
certain that x∗Ω  the optimal choice for Ω  is less regretted than x∗Ω  the optimal
choice for Ω  , which means that we would rather have Ω  be our set of models

than Ω  . Let I, I  be two pieces of preferential information and Ω I , Ω I the sets
obtained by integrating this information. Finding which of the two is the most
helpful statement in the progress towards a robust choice can therefore be done

by comparing mMR(Ω I ) and mMR(Ω I ). An optimal elicitation process (w.r.t.
minimax regret) would then choose the question for which the worst possible
answer gives us a restriction on Ω that is the most helpful in providing a robust
choice. However, computing such a question can be difficult, and the heuristic
we present next aims at picking a nearly optimal question in an efficient and
tractable way.

The Current Solution Strategy: Let’s assume that Ω  is the subset of deci-
sion models that is consistent with every information available so far to the
expert. Let’s restrict ourselves to questions that consist in comparing pairs x, y
of alternatives in X. The DM can only answer with I1 = x
y or I2 = x y.
A pair helpful in finding a robust solution as fast as possible can be computed
as a solution to the following optimization problem that consists in finding the
pair minimizing the worst-case min max regret:
 
min 2 WmMR({x, y}) = min 2 max mMR(Ω  ∩ Ω xy ), mMR(Ω  ∩ Ω xy )
(x,y)∈X (x,y)∈X
(7)
Preference Elicitation with Uncertainty: Extending Regret Based Methods 295

The current solution strategy (referred to as CSS) is a heuristic answer to this


problem that has proved to be efficient in practice [3]. It consists in asking
the DM to compare x∗ ∈ arg mMR(Ω  ) the least regretted alternative to y ∗ =
arg maxy∈X PMR(x∗ , y, Ω  ) the one it could be the most regretted to (its “worst
opponent”). CSS is efficient in the sense that it requires the computation of only
one value of min max regret, instead of the O(q 2 ) required to solve (7).

Example 4 (Choosing the best course (continued).). Using the same example,
according to Table 2, we have mMR(Ω) = MR(x1 , Ω) = PMR(x1 , x2 , Ω), mean-
ing that x1 is the least regretted alternative in the worst case and x2 is the one it
is most regretted to. The CSS heuristic consists in asking the DM to compare x1
and x2 , respectively the Machine learning course and the Optimization course.

2.3 Uncertain Preferential Information

Two key assumptions behind the methods we just described are that (1) the
initial chosen set Ω of models can perfectly describe the DM’s choices and (2)
the DM is an oracle, in the sense that any answer she provides truly reflects
her preferences, no matter how difficult the question. This certainly makes CSS
an efficient strategy, but also an unrealistic one. This means in particular that
if the DM makes a mistake, we will just pursue with this mistake all along the
process and will never question what was said before, possibly ending up with
sub-optimal recommendations.

Example 5 (choosing the best course (continued)). Let’s assume similarly to


Example 2 that the expert has not gathered any preference from the DM yet,
and that this time she asks her to compare alternatives x1 and x2 – respectively
the Machine learning course and the Optimization course. Let’s also assume
similarly to Example 1 that the DM makes decisions according to a weighted
sum model with weights ω  = (0.1, 0.8, 0.1). fω (x2 ) = 5.45 > 3.2 = fω (x1 ),
which means that she should prefer the Optimization course over the Machine
learning course. However for some reason – such as her being unfocused or unsure
about her preference – assume the DM’s answer is inconsistent with ω  and she
states that x1
x2 rather than x2
x1 .
Then Ω  the set of model consistent with
3 available

i preferential
information
is such that Ω  = Ω x1 x2 = {ω ∈ Ω : i=1 ω i
x1 − xi
2 ≥ 0} = {ω ∈ Ω :
ω 2 ≤ 23 − 24 ω }, as represented in Fig. 1. It is clear that ω  ∈ Ω  : subsequent
5 1

questions will only ever restrict Ω  and the expert will never get quite close to
modelling ω  .
A similar point could be made if ω ∗ , the model according to which the DM
makes her decision, does not even belong to Ω the set of weighted sums models
that the expert chose.

As we shall see, one way to adapt min-max regret approaches to circumvent


the two above difficulties can be to include a simple measure of how uncertain
an answer is.
296 P.-L. Guillot and S. Destercke

ω2
1
ω
0.8
Ω 8
2
19
3 11
19

Ω

ω1
0 0.1 1

Fig. 1. Graphical representation of Ω, Ω  and ω 

The Belief Function Framework. In classical CSS, the response to a query


by the DM always implies a set of consistent models Ω  such that Ω  ⊆ Ω.
Here, we allow the DM to give alongside her answer a confidence level α ∈
[0, 1], interpreted as how confident she is that this particular answer matches her
preferences. In the framework developed in [5], such information is represented

by a mass function on Ω  , referred to as mΩα and defined as:
 

α (Ω) = 1 − α, mΩ 
α (Ω ) = α.

Such mass assignments are usually called simple support [13] and represent ele-
mentary pieces of uncertain information. A confidence level of 0 will correspond
to a vacuous knowledge about the true model ω ∗ , and will in no way imply that
the answer is wrong (as would have been the case in a purely probabilistic frame-
work). A confidence level of 1 will correspond to the case of certainty putting a
hard constraint on the subset of models to consider.

Remark 1. Note that values of α do not necessarily need to come from the DM,
but can just be chosen by the analyst (in the simplest case as a constant) to
weaken the assumptions of classical models. We will see in the experiments of
Sect. 4 that such a strategy may indeed lead to interesting behaviours, without
necessitating the DM to provide confidence degrees if she thinks the task is too
difficult, or if the analyst thinks such self-assessed confidence is meaningless.

Dempster’s Rule. Pieces of information corresponding to each answer will be


combined through non-normalized Dempster’s rule +∩ . At step k, mk the mass
function capturing the current belief about the DM’s decision model can thus
be defined recursively as:

m 0 = mΩ
1 ... mk = mk−1 +∩ mΩ
αk .
k
(8)

This rule, also known as TBM conjunctive rule, is meant to combine distinct
pieces of information. It is central to the Transferable Belief Model, that intends
to justify belief functions without using probabilistic arguments [14].
Preference Elicitation with Uncertainty: Extending Regret Based Methods 297

Note that an information fusion setting and the interpretation of the TBM
fit our problem particularly well as it assumes the existence of a unique true
model ω ∗ underlying the DM’s decision process, that might or might not be in
our predefined set of models Ω. Allowing for an open world is a key feature of
the framework. Let us nevertheless recall that non-normalized Dempster’s rule
+∩ can also be justified without resorting to the TBM [8,9,11].
In our case this independence of sources associated with two mass assign-
Ωj
ments mΩ αi and mαj means that even though both preferential information
i

account for preferences of the same DM, the answer a DM gives to the ith
question does not directly impact the answer she gives to the jth question: she
would have answered the same thing had their ith answer been different for some
reason. This seems reasonable, as we do not expect the DM to have a clear intu-
ition about the consequences of her answers over the set of models, nor to even
be aware that such a set – or axioms underlying it – exists. One must however
be careful to not ask the exact same question twice in short time range.
Since combined masses are all possibility distributions, an alternative to
assuming independence would be to assume complete dependence, simply using
the minimum rule [6] which among other consequences would imply a loss of
expressivity3 but a gain in computation4 .
As said before, one of the key interest of using this rule (rather than its
normalised version) is to allow m(∅) > 0, notably to detect either mistakes in
the DM’s answer (considered as an unreliable source) or a bad choice of model
(under an open world assumption). Determining where the conflict mainly comes
from and acting upon it will be the topic of future works. Note that in the specific
case of simple support functions, we have the following result:

Proposition 1. If mΩ k
αk are simple support functions combined through Demp-
ster’s rule, then
m(∅) = 0 ⇔ ∃ω, P l({ω}) = 1

with P l({ω}) = E⊆Ω,ω∈E m(E) the plausibility measure of model ω.

Proof (Sketch). The ⇐ part is obvious given the properties of Plausibility mea-
sure. The ⇒ part follows from the fact that if m(∅) = 0, then all focal elements
are supersets of i∈{1,...,k} Ωi , hence all contains at least one common element.

This in particular shows that m(∅) can, in this specific case, be used as an
estimate of the logical consistency of the provided information pieces.

Consistency with Robust, Set-Based Methods: when an information I


is given with full certainty α = 1, we retrieve a so-called categorical mass

3
For instance, no new values of confidence would be created when using a finite set
{α1 , . . . , αM } for elicitation.
4
The number of focal sets increasing only linearly with the number of information
pieces.
298 P.-L. Guillot and S. Destercke


mk Ω I = 1. Combining a set I1 , ..., Ik of such certain information will end
up in the combined mass
⎛ ⎞

mk ⎝ Ωi ⎠ = 1
i∈{1,...,k}

which is simply the intersection of all provided constraints, that may turn up
either empty or non-empty, meaning that inconsistency will be a Boolean
notion, i.e.,

1 if i∈{1,...,k} Ωi = ∅
mk (Ø) =
0 otherwise

Recall that in the usual CSS or minimax regret strategies, such a situation can
never happen.

3 Extending CSS Within the Belief Function Framework

We now present our proposed extension of the Current Solution Strategy inte-
grating confidence degrees and uncertain answers. Note that in the two first
Sects. 3.1 and 3.2, we assume that the mass on the empty set is null in order
to parallel our approach with the usual one not including uncertainties. We will
then consider the problem of conflict in Sect. 3.3.

3.1 Extending Regret Notions

Extending PMR: when uncertainty over possible models is defined through a


mass function 2Ω → [0, 1], subsets of Ω known as focal sets are associated to
a value m(Ω  ) that correspond to the knowledge we have that ω belongs to Ω 
and nothing more. The extension we propose averages the value of PMR on focal
sets weighted by their corresponding mass:

EPMR(x, y, m) = m(Ω  ).PMR(x, y, Ω  ) (9)
Ω  ⊆Ω

and we can easily see that in the case of certain answers (α = 1), we do have
⎛ ⎛ ⎞⎞

EPMR(x, y, mk ) = PMR ⎝x, y, ⎝ Ωi ⎠⎠ (10)
i∈{1,...,k}

hence formally extending Eq. (4). When interpreting m(Ω  ) as the probability
that ω belongs to Ω  , EPMR could be seen as an expectation of PMR when
randomly picking a set in 2Ω .
Preference Elicitation with Uncertainty: Extending Regret Based Methods 299

Extending EMR: Similarly, we propose a weighted extension of maximum


regret
 
EMR(x, m) = m(Ω  ).MR(x, Ω  ) = m(Ω  ). max {PMR(x, y, Ω  )} .
y∈X
Ω  ⊆Ω Ω  ⊆Ω
(11)
EMR is the expectation of the maximal pairwise max regret taken each time
between x and y ∈ X its worst adversary – as opposed to a maximum consid-
ering each y ∈ X of the expected pairwise max regret between x and the given
y, described by MER(x, m) = maxy EPMR(x, y, m). Both approaches would be
equivalent to MR in the certain case, meaning that if αi = 1 then
⎛ ⎛ ⎞⎞

EMR(x, mk ) = MER(x, mk ) = MR ⎝x, ⎝ Ωi ⎠⎠ . (12)
i∈{1,...,k}

However EMR seems to be a better option to assess the max regret of an alter-
native, as under the assumption that the true model ω ∗ is within the focal set
Ω  , it makes more sense to compare x to its worst opponent within Ω  , which
may well be different for two different focal sets. Indeed, if ω ∗ the true model
does in fact belong to Ω  , decision x is only as bad as how big the regret can get
for any adversarial counterpart yΩ  ∈ X.
Extending mMR: we propose to extend it as

mEMR(m) = min EMR(x, m). (13)


x∈X

mEMR minimizes for each x ∈ X the expectation of max regret and is differ-
ent from the expectation of the minimal max regret for whichever alternative
x is optimal, described by EmMR(m) = Ω  m(Ω  ) minx∈X MR(x, Ω  ). Again,
these two options with certain answers boil down to mMR as we have
⎛ ⎞

mEMR(m) = EmMR(m) = mMR ⎝ Ωi ⎠ . (14)
i∈{1,...,k}

The problem with EmMR is that it would allow for multiple possible best alter-
natives, leaving us with an unclear answer as to what is the best choice option,
(arg min EmMR) not being defined. It indicates how robust in the sense of regret
we expect any best answer to the choice problem to be, assuming there can be
an optimal alternative for each focal set. In contrast, mEMR minimizes the max
regret while restricting the optimal alternative x to be the same in all of them,
hence providing a unique argument and allowing our recommendation system
and elicitation strategy to give an optimal recommendation.

Extending CSS: our Evidential Current Solution Strategy (ECSS) then


amounts, at step k with mass function mk , to perform the following sequence of
operations:
300 P.-L. Guillot and S. Destercke

– Find x∗ = arg mEMR(mk ) = arg minx∈X EMR(x, mk );


– Find y ∗ = arg maxy∈X EPMR(x∗ , y, mk );
k
– Ask the DM to compare x∗ , y ∗ and provide αk , obtaining mΩ αk ;
k
– Compute mk+1 := mk +∩ mΩ αk
– Repeat until conflict is too high (red flag), budget of questions is exhausted,
or mEMR(mk ) is sufficiently low

Finally, recommend x∗ = arg mEMR(mk ). Thanks to Eqs. (10), (12) and (14),
it is easy to see that we retrieve CSS as the special case in which all answers are
completely certain.

Example 6. Starting with intial mass function m0 such that m0 (Ω) = 1, the
choice of CSS coincides with the choice of ECSS (all evidence we have is com-
mitted to ω ∈ Ω). With the values of PMR reported in Table 2 the alterna-
tives the DM is asked to compare are x1 the least regretted alternative and x2
its most regretted counterpart. In accordance with her true preference model
ω ∗ = (0.1, 0.8, 0.1), the DM states that x2
x1 , i.e., she prefers the Optimiza-
tion course over the Machine learning course, with confidence degree α = 0.7.
Let Ω1 be the set of WS models in which x2 can be preferred to x1 , which in
symmetry with Example 5 can be defined as Ω1 = {ω ∈ Ω : ω 2 ≥ 23 − 24 ω },
5 1

as represented in Fig. 2.

ω2
1
ω
0.8
2 Ω1 8
19
3 11
19

ω1
0 0.1 1

Fig. 2. Graphical representation of Ω, Ω1 and ω 

Available information on her decision model after step 1 is represented by


mass function m1 with
m1 ( Ω ) = 0.3
m1 ( Ω1 ) = 0.7
The values of PMR and MR on Ω1 can be computed using LP as reported in
Table 3. Values of PMR and MR and have been previously computed on Ω
as reported in Table 2. Values of EPMR and EMR can be then be deduced by
combining them according to Eqs. (9) and (11), as reported in Table 4. In this
example x3 minimizes both MR and EMR and our extension agrees with the
Preference Elicitation with Uncertainty: Extending Regret Based Methods 301

robust version as to which recommendation is to be made. However the most


regretted counterpart to which the DM has to compare x3 in the next step differs,
as ECSS would require that she compares x3 and x1 rather than x3 and x2 for
CSS.

Table 3. Values of PMR(x, y, Ω1 ) (left) and MR(x, Ω1 ) (right)

@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5 53
53 x2  1.39
x2 0 0 38
 1.39 −1 38
x3 0.5
x3 − 56  −0.83 0.5 0 − 11
6
 −1.83 81
x4 4.26
x4 109
38
2.87 3.5 81
19
4.26 0 19

Table 4. Values of EPMR(x, y, m1 ) (left) and EMR(x, m1 ) (right)

@ y x1 x2 x3 x4 x MR
x@
@ x1 4
x1 0 4 3.5 0.5 1283
827 x2  3.38
x2 2.4 0 390
 2.18 0.5 380
23
23 x3  0.7
x3 30
 0.77 0.5 0 − 68
60
 −1.13 30
1389
809 909 x4 5.23
x4 190
4.26 3.5 190
4.78 0 380

3.2 Preserving the Properties of CSS


This section discusses to what extent is ECSS consistent with three key proper-
ties of CSS:
1. CSS is monotonic, in the sense that the minmax regret mMR reduces at each
iteration.
2. CSS provides strong guarantees, in the sense that the felt regret of the rec-
ommendation is ensured to be at least as bad as the computed mMR.
3. CSS produces questions that are non-conflicting (whatever the answer) with
previous answers.
We would like to keep the first two properties at least in the absence of conflict-
ing information, as they ensure respectively that the method will converge and
will provide robust recommendations. However, we would like our strategy to
raise questions possibly contradicting some previous answers, so as to raise the
previously mentioned red flags in case of problems (unreliable DM or bad choice
of model assumption). As shows the next property, our method also converges.
302 P.-L. Guillot and S. Destercke

k
Proposition 2. Let mk−1 and mΩ  mass functions on Ω issued from
αk be two
k
ECSS such that mk (∅) = mk−1 +∩ mΩαk (∅) = 0, then

1. EPMR(x, y, mk ) ≤ EPMR(x, y, mk−1 )


2. EMR(x, mk ) ≤ EMR(x, mk−1 )
3. mEMR(mk ) ≤ mEMR(mk−1 )

Proof (sketch). The two first items are simply due to the combined facts that
on one hand we know [15] that applying +∩ means that mk is a specialisation
of mk−1 , and on the other hand that for any Ω  ⊆ Ω  we have f (x, y, Ω  ) ≤
f (x, y, Ω  ) for any f ∈ {PMR, MR}. The third item is implied by the second as
it consists in taking a minimum over a set of values of EMR that are all smaller.
Note that the above argument applies to any combination rule producing
a specialisation of the two combined masses, including possibilistic minimum
rule [6], Denoeux’s family of w-based rules [4], etc. We can also show that the
evidential approach, if we provide it with questions computed through CSS, is
actually more cautious than CSS:
Proposition 3. Consider the subsets of models Ω1 , . . . , Ωk issued from the
answers of the CSS strategy, and some values α1 , . . . , αk provided a posteri-
k
ori by the DM. Let mk−1 and mΩ αk be two mass functions issued from ECSS on
Ω such that mk (∅) = 0. Then we have
 
1. EPMR(x, y, mk ) ≥ PMR(x, y, Ω
i∈{1,...,k} i )
 
2. EMR(x, mk ) ≥ MR(x, Ω i )
 i∈{1,...,k} 
3. mEMR(mk ) ≥ mMR( i∈{1,...,k} Ωi )

Proof (sketch). The first two items are  due to the combined
 facts that on one

hand all focal elements are supersets of i∈{1,...,k} Ωi and on the other hand
that for any Ω  ⊆ Ω  we have f (x, y, Ω  ) ≤ f (x, y, Ω  ) for any f ∈ {PMR, MR}.
Any value of EPMR or EMR  is a weighted average over terms all greater than
their robust counterpart on i∈{1,...,k} Ωi , and is therefore greater itself. The
third item is implied by the second as the biggest value of EMR is thus necessarily
bigger than all the values of MR.
This simply shows that, if anything, our method is even more cautious than
CSS. It is in that sense probably slightly too cautious in an idealized scenario –
especially as unlike robust indicators our evidential extensions will never reach
0 – but provides guarantees that are at least as strong.
While we find the two first properties appealing, one goal of including uncer-
tainties in the DM answers is to relax the third property, whose underlying
assumptions (perfectness of the DM and of the chosen model) are quite strong.
In Sects. 3.3 and 4, we show that ECSS indeed satisfies this requirement, respec-
tively on an example and in experiments.
Preference Elicitation with Uncertainty: Extending Regret Based Methods 303

3.3 Evidential CSS and Conflict

The following example simply demonstrates that, in practice, ECSS can lead
to questions that are possibly conflicting with each others, a feature CSS does
not have. This conflict is only a possibility: no conflict will appear should the
DM provide answers completely consistent with the set of models and what she
previously stated, and in that case at least one model will be fully plausible5
(see Proposition 1).

Example 7 (Choosing the best course (continued)). At step 2 of our example the
DM is asked to compare x1 to x3 in accordance with Table 4. Even though it
conflicts with ω ∗ the model underlying her decision the DM has the option to
state that x1
x3 with confidence degree α > 0, putting
weight on Ω2 the set
q i i i
of consistent model defined by Ω2 = {ω ∈ Ω : i=1 ω x1 − x3 ≥ 0} = {ω ∈
Ω : ω 2 ≤ 169
− 38 ω 1 }. However as represented in Fig. 3, Ω1 ∩ Ω2 = Ø.

ω2 ω2
1 1
ω ω
0.8 0.8
2 Ω1 8
19 2/3
Ω1 8
19
3 11 11
19 9/16 19
Ω
0.7
Ω 0.3
Ω2

ω1 ω1
0 0.1 1 0 0.1 1

Fig. 3. Graphical representation of Ω, Ω1 , Ω2 and ω 

This means that x1


x2 and x3
x1 are not compatible preferences assuming
the DM acts according to a weighted sum model. This can be either because she
actually does not follow such a model, or because one of her answers did not
reflect her actual preference (which would be the case here). Assuming she does
state that x1
x3 with confidence α = 0.6, information about the preference
model at step 2 is captured through mass function m2 defined as:

m2 : Ω → 0.12 Ω2 → 0.18
Ω1 → 0.28 Ø → 0.42

Meaning that ECSS detects a degree of inconsistency equal to m2 (∅) = 0.42.

This illustrating example is of course not representative of real situations,


having only four alternatives, but clearly shows that within ECSS, conflict may
5
This contrasts with a Bayesian/probabilistic approach, where no model would receive
full support in non-degenerate cases.
304 P.-L. Guillot and S. Destercke

appear as an effect of the strategy. In other words, we do not have to modify it to


detect consistency issues, it will automatically seek out for such cases if αi < 1.
In our opinion, this is clearly a desirable property showing that we depart from
the assumptions of CSS.
However, the inclusion of conflict as a focal element in the study raises new
issues, the first one being how to extend the various indicators of regret to
this situation. In other words, how can we compute PMR(x, y, ∅), MR(x, ∅) or
MMR(∅)? This question does not have one answer and requires careful think-
ing, however the most straightforward extension is to propose a way to compute
PMR(x, y, ∅), and then to plug in the different estimates of ECSS. Two possi-
bilities immediately come to mind:
– PMR(x, y, ∅) = maxω∈Ω Rω (x, y), which takes the highest regret among all
models, and would therefore be equivalent to consider conflict as ignorance.
This amounts to consider Yager’s rule [16] in the setting of belief functions.
This rule would make the regret increases when conflict appears, therefore
providing alerts, but this clearly means that the monotonicity of Proposi-
tion 3 would not longer hold. Yet, one could discuss whether such a property
is desirable or not in case of conflicting opinions. It would also mean that
elicitation methods are likely to try to avoid conflict, as it will induce a regret
increase.
– PMR(x, y, ∅) = minω∈Ω max(0, Rω (x, y)), considering ∅ as the limit inter-
section of smaller and smaller sets consistent with the DM answers. Such a
choice would allow us to recover monotonicity, and would clearly privilege
conflict as a good source of regret reduction.
Such distinctions expand to other indicators, but we leave such a discussion for
future works.

3.4 On Computational Tractability


ECSS requires, in principle, to compute PMR values for every possible focal
elements, which could lead to an exponential explosion of the computational
burden. We can however show that in the case of weighted sums and more gen-
erally of linear constraints, where PMR has to be solved through a LP program,
we can improve upon this worst-case bound. We introduce two simplifications
that lead to more efficient methods providing exact answers:

Using the Polynomial Number of Elementary Subsets. The computa-


tional cost can be reduced by using the fact that if Pj = {Ωi1 , . . . , Ωik } is a
partition of Ωj , then:
PMR(x, y, Ωj ) = max PMR(x, y, Ωl )
l∈{i1 ,...,ik }

Hence, computing PMR on the partition is sufficient to retrieve the global PMR
through a simple max. Let us now show that, in our case, the size of this parti-
tion only increases polynomially. Let Ω1 , . . . , Ωn be the set of models consistent
Preference Elicitation with Uncertainty: Extending Regret Based Methods 305

with respectively the first to the nth answer, and ΩiC , . . . , ΩnC their respective
complement in Ω.
Due to the nature of the conjunctive rule +∩ , every focal set Ω  of mk =
m1 +∩ mΩ
Ω Ωn
α1 +∩ . . . +∩ mαn is the union of elements of the partition PΩ  =
1

{Ω̃1 , . . . , Ω̃s }, with:


 
Ω̃k = Ω Ωi ΩiC , Uk ⊆ {1, . . . , n}
i∈Uk i∈{1,...,n}\Uk

Which means that for each Ω  ’s PMR can be computed using the PMR of its
corresponding partition. This still does not help much, as there is a total of 2n
possible value of Ω̃k . Yet, in the case of convex domains cut by linear constraints,
which holds for the weighted sum, the following theorem shows that the total
number of elementary subset in Ω only increases polynomially.

Theorem 1 [12, P 39]. Let E be a convex bounded subset of F an euclidean


space of dimension q, and H = {η1 , . . . , ηn } a set of n hyperplanes in F such
that ∀i ∈ {1, . . . , n}, ηi separates F into two subsets F0ηi and F1ηi .  η
To each of the 2n possible U ⊆ {1, . . . , n} a subset FUH = F F1 i
i∈U
i∈{1,...,n}\U F0ηi can be associated.
 
Let ΘH = U ⊆ {1, . . . , n} : FUH ∩ E = ∅ and BH = |ΘH |, then
   
n n
BH ≤ Λnq = 1 + n + + ··· +
2 q

Meaning that at most Λnq of the FUH subsets have a non empty intersection
with E.

In the above theorem (the proof of which can be found in [12], or in [10] for
the specific case of E ⊂ R3 ), the subsets BH are equivalent to Ω̃k , whose size
only grow according to a polynomial whose power increases with q.

Using the Polynomial Number of Extreme Points in the Simplex Prob-


lem. Since we work with LP, we also know that optimal values will be obtained
at extreme points. Optimization on focal sets can therefore ALL be done by
maxing points at the intersection of q hyperplanes. This set of extreme point is

E = {ω = ηi ∩ · · · ∩ ηq : {i1 , . . . , iq } ∈ {1, . . . , n}}


(15)


with ηi the hyper-planes corresponding to the questions. We have |E| = nq ∈
O(nq ) which is reasonable whenever q is small enough (typically the case in
MCDA). The computation of the coordinate of extreme points related to con-
straints of each subset can be done in advance for each ω ∈ E and not once per
subset and pair of alternatives, since EΩ  the set of extreme points of Ω  will
always be such that EΩ  ⊆ E. The computation of the dot products necessary
306 P.-L. Guillot and S. Destercke

to compute Rω (x, y) for all ω ∈ E, x, y ∈ X can also be done once for each
ω ∈ E, and not be repeated in each subset Ω  s.t. ω ∈ EΩ  . Those results indicate
us that when q (the model-space dimension) is reasonably low and questions
correspond to cutting hyper-planes over a convex set, ECSS can be performed
efficiently. This will be the case for several models such as OWA or k-additive
Choquet integrals with low k, but not for others such as full Choquet integrals,
whose dimension if we have k criteria is 2k − 2. In these cases, it seems inevitable
that one would resort to approximations having a low numerical impact (e.g.,
merging or forgetting focal elements having a very low mass value).

4 Experiments
To test our strategy and its properties, we proceeded to simulated experiments,
in which the confidence degree was always constant. Such experiments therefore
also show what would happen if we did not ask confidence degrees to the DM,
but nevertheless assumed that she could make mistakes with a very simple noise
model.
The first experiment reported in Fig. 4 compares the extra cautiousness of
EMR when compared to MR. To do so, simulations were made for several fixed
degrees of confidence – including 1 in which case EMR coincides with MR – in
which a virtual DM states her preferences with the given degree of confidence,
and the value of EMR at each step is divided by the initial value so as to observe
its evolution. Those EMR ratios were then averaged over 100 simulations for
each degree. Results show that while high confidence degrees will have a limited
impact, low confidence degrees (< 0.7) may greatly slow down the convergence.

1.0

0.8
MEMR/Initial MMR

0.6 α=0
α=0.3
α=0.5
0.4
α=0.7
α=0.8
0.2 α=0.9
α=1

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


Number of questions

Fig. 4. Average evolution of min max regret with various degrees of confidence

The second experiment reported in Fig. 5 aims at finding if ECSS and CSS
truly generate different question strategies. To do so, we monitored the two
Preference Elicitation with Uncertainty: Extending Regret Based Methods 307

strategies for a given confidence degree, and identify the first step k for which
the two questions are different. Those values were averaged over 300 simulations
for several confidence degrees. Results show that even for a high confidence
degree (α = 0.9) it takes in average only 3 question to see a difference. This
shows that the methods are truly different in practice.

3.2
Average number of questions to difference

3.0

2.8

2.6

2.4

2.2

2.0
0.0 0.2 0.4 0.6 0.8 1.0
Confidence degree

Fig. 5. Average position of the first different question in the elicitation process/degrees
of confidence

1.0 α=0.3 (RAND)


α=0.3 (WS)
α=0.5 (RAND)
0.8
α=0.5
Degree of inconsistency

(WS)
α=0.7 (RAND)
0.6 α=0.7 (WS)
α=0.9 (RAND)
α=0.9 (WS)
0.4

0.2

0.0
0 2 4 6 8 10 12 14
Number of questions

Fig. 6. Evolution of average inconsistency with a DM fitting the WS model and a


randomly choosing DM

The third experiment reported in Fig. 6 is meant to observe how good m(∅)
our measure of inconsistency is in practice as an indicator that something is
wrong with the answers given by a DM. In order to do so simulations were made
in which one of two virtual DMs answers with a fixed confidence degree and
308 P.-L. Guillot and S. Destercke

the value of m(∅) is recorded at each step. They were then averaged over 100
simulations for each confidence degree. The two virtual DMs behaved respec-
tively completely randomly (RAND) and in accordance with a fixed weighted
sum model (WS) with probability α and randomly with a probability 1 − α. So
the first one is highly inconsistent with our model assumption, while the second
is consistent with this assumptions but makes mistakes.
Results are quite encouraging: the inconsistency of the random DM with the
model assumption is quickly identified, especially for high confidence degrees.
For the DM that follows our model assumptions but makes mistakes, the results
are similar, except for the fact that the conflict increase is not especially higher
for lower confidence degrees. This can easily be explained that in case of low
confidence degrees, we have more mistakes but those are assigned a lower weight,
while in case of high confidence degrees the occasional mistake is quite impactful,
as it has a high weight.

5 Conclusion

In this paper, we have proposed an evidential extension of the CSS strategy,


used in robust elicitation of preferences.
We have studied its properties, notably comparing them to those of CSS, and
have performed first experiments to demonstrate the utility of including confi-
dence degrees in robust preference elicitation. Those latter experiments confirm
the interest of our proposal, in the sense that it quickly identifies inconsistencies
between the DM answer and model assumptions. It remains to check whether, in
presence of mistakes from the DM, the real-regret (and not the computed one)
obtained for ECSS is better than the one obtained for CSS.
As future works, we would like to work on the next step, i.e., identify the
sources of inconsistency (whether it comes from bad model assumption or an
unreliable DM) and propose correction strategies. We would also like to perform
more experiments, and extend our approach to other decision models (Choquet
integrals and OWA operators being the first candidates).

References
1. Benabbou, N., Gonzales, C., Perny, P., Viappiani, P.: Incremental elicitation of
choquet capacities for multicriteria choice, ranking and sorting problems. Artif.
Intell. 246, 152–180 (2017)
2. Benabbou, N., Gonzales, C., Perny, P., Viappiani, P.: Minimax regret approaches
for preference elicitation with rank-dependent aggregators. EURO J. Decis. Pro-
cesses 3(1–2), 29–64 (2015)
3. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based opti-
mization and utility elicitation using the minimax decision criterion. Artif. Intell.
170(8–9), 686–713 (2006)
4. Denœux, T.: Conjunctive and disjunctive combination of belief functions induced
by nondistinct bodies of evidence. Artif. Intell. 172(2–3), 234–264 (2008)
Preference Elicitation with Uncertainty: Extending Regret Based Methods 309

5. Destercke, S.: A generic framework to include belief functions in preference han-


dling and multi-criteria decision. Int. J. Approximate Reasoning 98, 62–77 (2018)
6. Destercke, S., Dubois, D.: Idempotent conjunctive combination of belief functions:
extending the minimum rule of possibility theory. Inf. Sci. 181(18), 3925–3945
(2011)
7. Fürnkranz, J., Hüllermeier, E.: Preference Learning. Springer, Heidelberg (2010).
https://doi.org/10.1007/978-3-642-14125-6
8. Klawonn, F., Schweke, E.: On the axiomatic justification of Dempster’s rule of
combination. Int. J. Intell. Syst. 7(5), 469–478 (1992)
9. Klawonn, F., Smets, P.: The dynamic of belief in the transferable belief model and
specialization-generalization matrices. In: Proceedings of the 8th Conference on
Uncertainty in Artificial Intelligence (2013)
10. Orlik, P., Terao, H.: Arrangements of hyperplanes, vol. 300. Springer Science &
Business Media, Heidelberg (2013)
11. Pichon, F., Denoeux, T.: The unnormalized Dempster’s rule of combination: a new
justification from the least commitment principle and some extensions. J. Autom.
Reasoning 45(1), 61–87 (2010)
12. Schläfli, L., Wild, H.: Theorie der vielfachen Kontinuität, vol. 38. Springer, Basel
(1901). https://doi.org/10.1007/978-3-0348-5118-3
13. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
14. Smets, P.: The combination of evidence in the transferable belief model. IEEE
Trans. Pattern Anal. Mach. Intell. 12(5), 447–458 (1990)
15. Smets, P.: The application of the matrix calculus to belief functions. Int. J.
Approxim. Reasoning 31(1–2), 1–30 (2002)
16. Yager, R.R.: On the dempster-shafer framework and new combination rules. Inf.
Sci. 41(2), 93–137 (1987)
Evidence Propagation and Consensus
Formation in Noisy Environments

Michael Crosscombe1(B) , Jonathan Lawry1 , and Palina Bartashevich2


1
University of Bristol, Bristol, UK
{m.crosscombe,j.lawry}@bristol.ac.uk
2
Otto von Guericke University Magdeburg, Magdeburg, Germany
[email protected]

Abstract. We study the effectiveness of consensus formation in multi-


agent systems where there is both belief updating based on direct evi-
dence and also belief combination between agents. In particular, we con-
sider the scenario in which a population of agents collaborate on the
best-of-n problem where the aim is to reach a consensus about which is
the best (alternatively, true) state from amongst a set of states, each with
a different quality value (or level of evidence). Agents’ beliefs are repre-
sented within Dempster-Shafer theory by mass functions and we inves-
tigate the macro-level properties of four well-known belief combination
operators for this multi-agent consensus formation problem: Dempster’s
rule, Yager’s rule, Dubois & Prade’s operator and the averaging oper-
ator. The convergence properties of the operators are considered and
simulation experiments are conducted for different evidence rates and
noise levels. Results show that a combination of updating on direct evi-
dence and belief combination between agents results in better consensus
to the best state than does evidence updating alone. We also find that in
this framework the operators are robust to noise. Broadly, Yager’s rule
is shown to be the better operator under various parameter values, i.e.
convergence to the best state, robustness to noise, and scalability.

Keywords: Evidence propagation · Consensus formation ·


Dempster-Shafer theory · Distributed decision making · Multi-agent
systems

1 Introduction

Agents operating in noisy and complex environments receive evidence from a


variety of different sources, many of which will be at least partially inconsistent.
In this paper we investigate the interaction between two broad categories of
evidence: (i) direct evidence from the environment, and (ii) evidence received
from other agents with whom an agent is interacting or collaborating to perform
a task. For example, robots engaged in a search and rescue mission will receive
data directly from sensors as well as information from other robots in the team.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 310–323, 2019.
https://doi.org/10.1007/978-3-030-35514-2_23
Evidence Propagation and Consensus Formation in Noisy Environments 311

Alternatively, software agents can have access to online data as well as sharing
data with other agents.
The efficacy of combining these two types of evidence in multi-agent systems
has been studied from a number of different perspectives. In social epistemol-
ogy [6] has argued that agent-to-agent communications has an important role
to play in propagating locally held information widely across a population. For
example, interaction between scientists facilitates the sharing of experimental
evidence. Simulation results are then presented which show that a combination
of direct evidence and agent interaction, within the Hegselmann-Krause opinion
dynamics model [10], results in faster convergence to the true state than updat-
ing based solely on direct evidence. A probabilistic model combining Bayesian
updating and probability pooling of beliefs in an agent-based system has been
proposed in [13]. In this context it is shown that combining updating and pooling
leads to faster convergence and better consensus than Bayesian updating alone.
An alternative methodology exploits three-valued logic to combine both types
of evidence [2] and has been effectively applied to distributed decision-making
in swarm robotics [3].
In this current study we exploit the capacity of Dempster-Shafer theory
(DST) to fuse conflicting evidence in order to investigate how direct evidence
can be combined with a process of iterative belief aggregation in the context
of the best-of-n problem. The latter refers to a general class of problems in
distributed decision-making [16,22] in which a population of agents must col-
lectively identify which of n alternatives is the correct, or best, choice. These
alternatives could correspond to physical locations as, for example, in a search
and rescue scenario, different possible states of the world, or different decision-
making or control strategies. Agents receive direct but limited feedback in the
form of quality values associated with each choice, which then influence their
beliefs when combined with those of other agents with whom they interact. It is
not our intention to develop new operators in DST nor to study the axiomatic
properties of particular operators at the local level (see [7] for an overview of
such properties). Instead, our motivation is to study the macro-level conver-
gence properties of several established operators when applied iteratively by a
population of agents, over long timescales, and in conjunction with a process of
evidential updating, i.e., updating beliefs based on evidence.
An outline of the remainder of the paper is as follows. In Sect. 2 we give a
brief introduction to the relevant concepts from DST and summarise its previous
application to dynamic belief revision in agent-based systems. Section 3 intro-
duces a version of the best-of-n problem exploiting DST measures and combi-
nation operators. In Sect. 4 we then give the fixed point analysis of a dynamical
system employing DST operators so as to provide insight into the convergence
properties of such systems. In Sect. 5 we present the results from a number of
agent-based simulation experiments carried out to investigate consensus forma-
tion in the best-of-n problem under varying rates of evidence and levels of noise.
Finally, Sect. 6 concludes with some discussion.
312 M. Crosscombe et al.

2 An Overview of Dempster-Shafer Theory


In this section we introduce relevant concepts from Dempster-Shafer theory
(DST) [5,19], including four well-known belief combination operators.
Definition 1. Mass function (or agent’s belief )
Given a set of states or frame of discernment S = {s1 , ..., sn }, let 2S denote
the power set of S. An agent’s belief is then defined by a basic probability
 assign-
ment or mass function m : 2S → [0, 1], where m(∅) = 0 and A⊆S m(A) = 1.
The mass function then characterises a belief and a plausibility measure defined
on 2S such that for A ⊆ S:
 
Bel(A) = m(B) and P l(A) = m(B)
B⊆A B:B∩A=∅

c
and hence where P l(A) = 1 − Bel(A ).
A number of operators have been proposed in DST for combining or fusing
mass functions [20]. In this paper we will compare in a dynamic multi-agent set-
ting the following operators: Dempster’s rule of combination (DR) [19], Dubois
& Prade’s operator (D&P) [8], Yager’s rule (YR) [25], and a simple averaging
operator (AVG). The first three operators all make the assumption of inde-
pendence between the sources of the evidence to be combined but then employ
different techniques for dealing with the resulting inconsistency. DR uniformly
reallocates the mass associated with non-intersecting pairs of sets to the overlap-
ping pairs, D&P does not re-normalise in such cases but instead takes the union
of the two sets, while YR reallocates all inconsistent mass values to the universal
set S. These four operators were chosen based on several factors: the operators
are well established and have been well studied, they require no additional infor-
mation about individual agents, and they are computationally efficient at scale
(within the limits of DST).
Definition 2. Combination operators
Let m1 and m2 be mass functions on 2S . Then the combined mass function
m1  m2 is a function m1  m2 : 2S → [0, 1] such that for ∅ = A, B, C ⊆ S:
1 
(DR) m1  m2 (C) = m1 (A) · m2 (B),
1−K
A∩B=C=∅
 
(D&P) m1  m2 (C) = m1 (A) · m2 (B) + m1 (A) · m2 (B),
A∩B=C=∅ A∩B=∅,
A∪B=C

(YR) m1  m2 (C) = m1 (A) · m2 (B) if C = S, and
A∩B=C=∅

m1  m2 (S) = m1 (S) · m2 (S) + K,


1
(AVG) m1  m2 (C) = (m1 (C) + m2 (C)) ,
2

where K is associated with conflict, i.e., K = A∩B=∅ m1 (A) · m2 (B).
Evidence Propagation and Consensus Formation in Noisy Environments 313

In the agent-based model of the best-of-n problem, proposed in Sect. 3, agents


are required to make a choice as to which of n possible states they should inves-
tigate at any particular time. To this end we utilise the notion of pignistic dis-
tribution proposed by Smets and Kennes [21].

Definition 3. Pignistic distribution


For a given mass function m, the corresponding pignistic distribution on S is
a probability distribution obtained by reallocating the mass associated with each
set A ⊆ S uniformly to the elements of that set, i.e., si ∈ A, as follows:
 m(A)
P (si |m) = .
|A|
A:si ∈A

DST has been applied to multi-agent dynamic belief revision in a number of


ways. For example, [4] and [24] investigate belief revision where agents update
their beliefs by taking a weighted combination of conditional belief values of
other agents using Fagin-Halpern conditional belief measures. These measures
are motivated by the probabilistic interpretation of DST according to which a
belief and plausibility measure are characterised by a set of probability distribu-
tions on S. Several studies [1,2,15] have applied a three-valued version of DST
in multi-agent simulations. This corresponds to the case in which there are two
states, S = {s1 , s2 }, one of which is associated with the truth value true (e.g.,
s1 ), one with false (s2 ), and where the set {s1 , s2 } is then taken as corresponding
to a third truth state representing uncertain or borderline. One such approach
based on subjective logic [1] employs the combination operator proposed in [11].
Another [15] uses Dempster’s rule applied to combine an agent’s beliefs with
an aggregate of those of her neighbours. Similarly, [2] uses Dubois & Prade’s
operator for evidence propagation. Other relevant studies include [12] in which
Dempster’s rule is applied across a network of sparsely connected agents.
With the exception of [2], and only for two states, none of the above studies
considers the interaction between direct evidential updating and belief combina-
tion. The main contribution of this paper is therefore to provide a detailed and
general study of DST applied to dynamic multi-agent systems in which there
is both direct evidence from the environment and belief combination between
agents with partially conflicting beliefs. In particular, we will investigate and
compare the consensus formation properties of the four combination operators
(Definition 2) when applied to the best-of-n problem.

3 The Best-of-n Problem within DST

Here we present a formulation of the best-of-n problem within the DST frame-
work. We take the n choices to be the states S. Each state si ∈ S is assumed to
have an associated quality value qi ∈ [0, 1] with 0 and 1 corresponding to min-
imal and maximal quality, respectively. Alternatively, we might interpret qi as
314 M. Crosscombe et al.

quantifying the level of available evidence that si corresponds to the true state
of the world.
In the best-of-n problem agents explore their environment and interact with
each other with the aim of identifying which is the highest quality (or true)
state. Agents sample states and receive evidence in the form of the quality qi ,
so that in the current context evidence Ei regarding state si takes the form of
the following mass function;
mEi = {si } : qi , S : 1 − qi .
Hence, qi is taken as quantifying both the evidence directly in favour of si pro-
vided by Ei , and also the evidence directly against any other state sj for j = i.
Given evidence Ei an agent updates its belief by combining its current mass
function m with mEi using a combination operator so as to obtain the new mass
function given by m  mEi .
A summary of the process by which an agent might obtain direct evidence
in this model is then as follows. Based on its current mass function m, an agent
stochastically selects a state si ∈ S to investigate1 , according to the pignistic
probability distribution for m as given in Definition 3. More specifically, it will
update m to m  mEi with probability P (si |m) × r for i = 1, . . . , n and leave
its belief unchanged with probability (1 − r), where r ∈ [0, 1] is a fixed evidence
rate quantifying the probability of finding evidence about the state that it is
currently investigating. In addition, we also allow for the possibility of noise
in the evidential updating process. This is modelled by a random variable  ∼
N (0, σ 2 ) associated with each quality value. In other words, in the presence of
noise the evidence Ei received by an agent has the form:
mEi = {si } : qi + , S : 1 − qi − ,
where if qi +  < 0 then it is set to 0, and if qi +  > 1 then it is set to 1. Overall,
the process of updating from direct evidence is governed by the two parameters,
r and σ, quantifying the availability of evidence and the level of associated noise,
respectively.
In addition to receiving direct evidence we also include belief combination
between agents in this model. This is conducted in a pairwise symmetric manner
in which two agents are selected at random to combine their beliefs, with both
agents then adopting this combination as their new belief, i.e., if the two agents
have beliefs m1 and m2 , respectively, then they both replace these with m1 m2 .
However, in the case that agents are combining their beliefs under Dempster’s
rule and that their beliefs are completely inconsistent, i.e., when K = 1 (see
Definition 2), then they do not form consensus and the process moves on to the
next iteration.
In summary, during each iteration both processes of evidential updating and
consensus formation take place2 . However, while every agent in the population
1
We utilise roulette wheel selection; a proportionate selection process.
2
Due to the possibility of rounding errors occurring as a result of the multiplication
of small numbers close to 0, we renormalise the mass function that results from each
process.
Evidence Propagation and Consensus Formation in Noisy Environments 315

has the potential to update its own belief, provided that it successfully receives a
piece of evidence, the consensus formation is restricted to a single pair of agents
for each iteration. That is, we assume that only two agents in the whole popu-
lation are able to communicate and combine their beliefs during each iteration.

4 Fixed Point Analysis


In the following, we provide an insight into the convergence properties of the
dynamical system described in Sect. 2. Consider an agent model in which at each
time step t two agents are selected at random to combine their beliefs from a pop-
ulation of k agents A = {a1 . . . , ak } with beliefs quantified by mass functions mti :
i = 1, . . . , k. For any t the state of the system can be represented by a vector of
mass functions mt1 , . . . , mtk
. Without loss of generality, we can assume that the
updated state is then mt+1 t+1 t+1 t t t t t t
1 , m2 , . . . , mk
= m1 m2 , m1 m2 , m3 , . . . , mk
.
Hence, we have a dynamical system characterised by the following mapping:

mt1 , . . . , mtk
→ mt1  mt2 , mt1  mt2 , mt3 , . . . , mtk
.

The fixed points of this mapping are those for which mt1 = mt1  mt2 and mt2 =
mt1  mt2 . This requires that mt1 = mt2 and hence the fixed point of the mapping
are the fixed points of the operator, i.e., those mass functions m for which m 
m = m.
Let us analyse in detail the fixed points for the case in which there are 3 states
S = {s1 , s2 , s3 }. Let m = {s1 , s2 , s3 } : x7 , {s1 , s2 } : x4 , {s1 , s3 } : x5 , {s2 , s3 } :
x6 , {s1 } : x1 , {s2 } : x2 , {s3 } : x3 represent a general mass function defined on
this state space and where without loss of generality we take x7 = 1 − x1 − x2 −
x3 − x4 − x5 − x6 . For Dubois & Prade’s operator the constraint that m  m = m
generates the following simultaneous equations.

x21 + 2x1 x4 + 2x1 x5 + 2x1 x7 + 2x4 x5 = x1


x22 + 2x2 x4 + 2x2 x6 + 2x2 x7 + 2x4 x6 = x2
x23 + 2x3 x5 + 2x3 x6 + 2x3 x7 + 2x5 x6 = x3
x24 + 2x1 x2 + 2x4 x7 = x4
x25 + 2x1 x3 + 2x5 x7 = x5
x26 + 2x2 x3 + 2x6 x7 = x6

The Jacobian for this set of equations is given by:


 

J= m  m(Ai ) ,
∂xj

where A1 = {s1 }, A2 = {s2 }, A3 = {s3 }, A4 = {s1 , s2 }, . . . The stable fixed


points are those solutions to the above equations for which the eigenvalues of
the Jacobian evaluated at the fixed point lie within the unit circle on the complex
plane. In this case the only stable fixed points are the mass functions {s1 } : 1,
316 M. Crosscombe et al.

{s2 } : 1 and {s3 } : 1. In other words, the only stable fixed points are those for
which agents’ beliefs are both certain and precise. That is where for some state
si ∈ S, Bel({si }) = P l({si }) = 1. The stable fixed points for Dempster’s rule
and Yager’s rule are also of this form. The averaging operator is idempotent and
all mass functions are unstable fixed points.
The above analysis concerns agent-based systems applying a combination
in order to reach consensus. However, we have yet to incorporate evidential
updating into this model. As outlined in Sect. 3, it is proposed that each agent
investigates a particular state si chosen according to its current beliefs using
the pignistic distribution. With probability r this will result in an update to its
beliefs from m to m  mEi . Hence, for convergence it is also required that agents
only choose to investigate states for which m  mEi = m. Assuming qi > 0,
then there is only one such fixed point corresponding to m = {si } : 1. Hence,
the consensus driven by belief combination as characterised by the above fixed
point analysis will result in convergence of individual agent beliefs if we also
incorporate evidential updating. That is, an agent with beliefs close to a fixed
point of the operator, i.e., m = {si } : 1, will choose to investigate state si with
very high probability and will therefore tend to be close to a fixed point of the
evidential updating process.

5 Simulation Experiments
In this section we describe experiments conducted to understand the behaviour
of the four belief combination operators in the context of the dynamic multi-
agent best-of-n problem introduced in Sect. 3. We compare their performance
under different evidence rates r, noise levels σ, and their scalability for different
numbers of states n.

5.1 Parameter Settings


Unless otherwise stated, all experiments share the following parameter values.
We consider a population A of k = 100 agents with beliefs initialised so that:

m0i = S : 1 for i = 1, . . . , 100.

In other words, at the beginning of each simulation every agent is in a state


of complete ignorance as represented in DST by allocating all mass to the set
of all states S. Each experiment is run for a maximum of 5 000 iterations, or
until the population converges. Here, convergence requires that the beliefs of the
population have not changed for 100 interactions, where an interaction may be
the updating of beliefs based on evidence or the combination of beliefs between
agents. For a given set of parameter values the simulation is run 100 times and
results are then averaged across these runs.
i
Quality values are defined so that qi = n+1 for i = 1, . . . , n and consequently
sn is the best state. In the following, Bel({sn }) provides a measure of convergence
performance for the considered operators.
Evidence Propagation and Consensus Formation in Noisy Environments 317

Fig. 1. Average Bel({s3 }) plotted against iteration t with r = 0.05 and σ = 0.1.
Comparison of all four operators with error bars displaying the standard deviation.

5.2 Convergence Results


Initially we consider the best-of-n problem where n = 3 with quality values
q1 = 0.25, q2 = 0.5 and q3 = 0.75. Figure 1 shows belief values for the best state
s3 averaged across agents and simulation runs for the evidence rate r = 0.05
and noise standard deviation σ = 0.1. For both Dubois & Prade’s operator and
Yager’s rule there is complete convergence to Bel({s3 }) = 1 while for Dempster’s
rule the average value of Bel({s3 }) at steady state is approximately 0.9. The
averaging operator does not converge to a steady state and instead maintains
an average value of Bel({s3 }) oscillating around 0.4. For all but the averaging
operator, at steady state the average belief and plausibility values are equal. This
is consistent with the fixed point analysis given for Dubois & Prade’s operator
in Sect. 4, showing that all agents converge to mass functions of the form m =
{si } : 1 for some state si ∈ S. Indeed, for both Dubois & Prade’s operator
and Yager’s rule all agents converge to m = {s3 } : 1, while for Dempster’s rule
this happens in the large majority of cases. In other words, the combination of
updating from direct evidence and belief combination results in agents reaching
the certain and precise belief that s3 is the true state of the world.

5.3 Varying Evidence Rates


In this section we investigate how the rate at which agents receive information
from their environment affects their ability to reach a consensus about the true
state of the world.
Figures 2a and b show average steady state values of Bel({s3 }) for evidence
rates in the lower range r ∈ [0, 0.01] and across the whole range r ∈ [0, 1], respec-
tively. For each operator we compare the combination of evidential updating and
belief combination (solid lines) with that of evidential updating alone (dashed
lines). From Fig. 2a we see that for low values of r ≤ 0.02 Dempster’s rule con-
verges to higher average values of Bel({s3 }) than do the other operators. Indeed,
for 0.001 ≤ r ≤ 0.006 the average value of Bel({s3 }) obtained using Dempster’s
rule is approximately 10% higher than is obtained using Dubois & Prade’s oper-
ator and Yager’s rule, and is significantly higher still than that of the averaging
operator. However, the performance of Dempster’s rule declines significantly for
318 M. Crosscombe et al.

(a) Low evidence rates r ∈ [0, 0.01]. (b) All evidence rates r ∈ [0, 1].

Fig. 2. Average Bel({s3 }) for evidence rates r ∈ [0, 1]. Comparison of all four operators
both with and without belief combination between agents.

Fig. 3. Standard deviation for different evidence rates r ∈ [0, 0.5]. Comparison of all
four operators both with and without belief combination between agents.

higher evidence rates and for r > 0.3 it converges to average values for Bel({s3 })
of less than 0.8. At r = 1, when every agent is receiving evidence at each time
step, there is failure to reach consensus when applying Dempster’s rule. Indeed,
there is polarisation with the population splitting into separate groups, each cer-
tain that a different state is the best. In contrast, both Dubois & Prade’s operator
and Yager’s rule perform well for higher evidence rates and for all r > 0.02 there
is convergence to an average value of Bel({s3 }) = 1. Meanwhile the averaging
operator appears to perform differently for increasing evidence rates and instead
maintains similar levels of performance for r > 0.1. For all subsequent figures
showing steady state results, we do not include error bars as this impacts nega-
tively on readability. Instead, we show the standard deviation plotted separately
against the evidence rate in Fig. 3. As expected, standard deviation is high for
low evidence rates in which the sparsity of evidence results in different runs of
the simulation converging to different states. This then declines rapidly with
increasing evidence rates.
The dashed lines in Figs. 2a and b show the values of Bel({s3 }) obtained
at steady state when there is only updating based on direct evidence. In
most cases the performance is broadly no better than, and indeed often worse
than, the results which combine evidential updating with belief combination
Evidence Propagation and Consensus Formation in Noisy Environments 319

between agents. For low evidence rates where r < 0.1 the population does not
tend to fully converge to a steady state since there is insufficient evidence avail-
able to allow convergence. For higher evidence rates under Dempster’s rule,
Dubois & Prade’s operator and Yager’s rule, the population eventually converges
on a single state with complete certainty. However, since the average value of
Bel({s3 }) in both cases is approximately 0.6 for r > 0.002 then clearly conver-
gence is often not to the best state. The averaging operator is not affected by the
combined updating method and performs the same under evidential updating
alone as it does in conjunction with consensus formation.
Overall, it is clear then that in this formulation of the best-of-n problem
combining both updating from direct evidence and belief combination results in
much better performance than obtained by using evidential updating alone for
all considered operators except the averaging operator.

5.4 Noisy Evidence

Noise is ubiquitous in applications of multi-agent systems. In embodied agents


such as robots this is often a result of sensor errors, but noise can also be a
feature of an inherently variable environment. In this section we consider the
effect of evidential noise on the best-of-n problem, as governed by the standard
distribution σ of the noise.
Figure 4 shows the average value of Bel({s3 }) at steady state plotted against
σ ∈ [0, 0.3] for different evidence rates r ∈ {0.01, 0.05, 0.1}. Figure 4 (left) shows
that for an evidence rate r = 0.01, all operators except the averaging operator
have very similar performance in the presence of noise. For example with no
noise, i.e., σ = 0, Yager’s rule converges to an average value of 0.97, Dubois
& Prade’s operator converges to an average of Bel({s3 }) = 0.96, Dempster’s
rule to 0.95 on average, and the averaging operator to 0.4. Then, with σ = 0.3,
Yager’s rule converges to an average value of 0.8, Dubois & Prade’s operator to
an average value of Bel({s3 }) = 0.77, Dempster’s rule to 0.74, and the averaging
operator converges to 0.29. Hence, all operators are affected by the noise to a
similar extent given this low evidence rate.

Fig. 4. Average Bel({s3 }) for all four operators plotted against σ ∈ [0, 0.3] for different
evidence rates r. Left: r = 0.01. Centre: r = 0.05. Right: r = 0.1.
320 M. Crosscombe et al.

In contrast, for the evidence rates of r = 0.05 and r = 0.1, Fig. 4 (centre) and
(right), respectively, we see that both Dubois & Prade’s operator and Yager’s
rule are the most robust combination operators to increased noise. Specifically,
for r = 0.05 and σ = 0, they both converge to an average value of Bel({s3 }) = 1
and for σ = 0.3 they only decrease to 0.99. On the other hand, the presence
of noise at this evidence rate has a much higher impact on the performance of
Dempster’s rule and the averaging operator. For σ = 0 Dempster’s rule converges
to an average value of Bel({s3 }) = 0.95 but this decreases to 0.78 for σ = 0.3, and
for the averaging operator the average value of Bel({s3 }) = 0.41 and decreases
to 0.29. The contrast between the performance of the operators in the presence
of noise is even greater for the evidence rate r = 0.1 as seen in Fig. 4 (right).
However, both Dubois & Prade’s operator and Yager’s rule differ in this context
since, for both evidence rates r = 0.05 and r = 0.1, their average values of
Bel({s3 }) remain constant at approximately 1.

5.5 Scalability to Larger Numbers of States

In the swarm robotics literature most best-of-n studies are for n = 2 (see for
example [17,23]). However, there is a growing interest in studying larger numbers
of choices in this context [3,18]. Indeed, for many distributed decision-making
applications the size of the state space, i.e., the value of n in the best-of-n
problem, will be much larger. Hence, it is important to investigate the scalability
of the proposed DST approach to larger values of n.
Having up to now focused on the n = 3 case, in this section we present
additional simulation results for n = 5 and n = 10. As proposed in Sect. 5.1,
i
the quality values are allocated so that qi = n+1 for i = 1, . . . , n. Here, we
only consider Dubois & Prade’s operator and Yager’s rule due to their better
performance when compared with the other two combination operators.

(a) Dubois & Prade’s operator. (b) Yager’s rule.

Fig. 5. Average Bel({sn }) for n ∈ {3, 5, 10} plotted against σ for r = 0.05.

Figure 5 shows the average values of Bel({sn }) at steady state plotted against
noise σ ∈ [0, 0.3] for evidence rate r = 0.05, where Bel({sn }) is the belief in the
best state for n = 3, 5 and 10. For Dubois & Prade’s operator, Fig. 5a shows
Evidence Propagation and Consensus Formation in Noisy Environments 321

the steady state values of Bel({s3 }) = 1 independent of the noise level, followed
closely by the values of Bel({s5 }) = 0.94 at σ = 0 for the n = 5 case. However, for
n = 10 the value of Bel({s10} ) is 0.61 when σ = 0, corresponding to a significant
decrease in performance. At the same time, from Fig. 5b, we can see that for
Yager’s rule performance declines much less rapidly with increasing n than for
Dubois & Prade’s operator. So at σ = 0 and n = 5 the average value at steady
state for Yager’s rule is almost the same as for n = 3, i.e. Bel({s5 }) = 0.98, with
a slight decrease in the performance Bel({s10 }) = 0.92 for n = 10. As expected
the performance of both operators decreases as σ increases, with Yager’s rule
being much more robust to noise than Dubois & Prade’s operator for large values
of n.
In this way, the results support only limited scalability for the DST approach
to the best-of-n problem, at least as far as uniquely identifying the best state is
concerned. Furthermore, as n increases so does sensitivity to noise. This reduced
performance may in part be a feature of the way quality values have been allo-
cated. Notice that as n increases, the difference between successive quality values
qi+1 − qi = n+1
1
decreases. This is likely to make it difficult for a population of
agents to distinguish between the best state and those which have increasingly
similar quality values. Furthermore, a given noise standard deviation σ results
in an inaccurate ordering of the quality values the closer those values are to each
other, making it difficult for a population of agents to distinguish between the
best state and those which have increasingly similar quality values.

6 Conclusions and Future Work


In this paper we have introduced a model of consensus formation in the best-of-n
problem which combines updating from direct evidence with belief combination
between pairs of agents. We have utilised DST as a convenient framework for
representing agents’ beliefs, as well as the evidence that agents receive from
the environment. In particular, we have studied and compared the macro-level
convergence properties of several established operators applied iteratively in a
dynamic multi-agent setting and through simulation we have identified several
important properties of these operators within this context. Yager’s rule and
Dubois & Prade’s operator are shown to be most effective at reducing polari-
sation and reaching a consensus for all except very low evidence rates, despite
them not satisfying certain desirable properties, e.g., Dubois & Prade’s operator
is not associative while Yager’s rule is only quasi-associative [7]. Both have also
demonstrated robustness to different noise levels. However, Yager’s rule is more
robust to noise than Dubois & Prade’s operator for large values of states n > 3.
Although the performance of both operators decreases with an increase in the
number of states n, Yager’s rule is shown to be more scalable. We believe that
underlying the difference in the performance of all but the averaging operator is
the way in which they differ in their handling of inconsistent beliefs. Specifically,
the manner in which they reallocate the mass associated with the inconsistent
non-overlapping sets in the case of Dempster’s rule, Dubois & Prade’s operator
and Yager’s rule.
322 M. Crosscombe et al.

Further work will investigate the issue of scalability in more detail, including
whether alternatives to the updating process may be applicable in a DST model,
such as that of negative updating in swarm robotics [14]. We must also consider
the increasing computational cost of DST as the size of the state space increases
and investigate other representations such as possibility theory [9] as a means
of avoiding exponential increases in the cost of storing and combining mass
functions. Finally, we hope to adapt our method to be applied to a network, as
opposed to a complete graph, so as to study the effects of limited or constrained
communications on convergence.

Acknowledgments. This work was funded and delivered in partnership between


Thales Group, University of Bristol and with the support of the UK Engineering and
Physical Sciences Research Council, ref. EP/R004757/1 entitled “Thales-Bristol Part-
nership in Hybrid Autonomous Systems Engineering (T-B PHASE)”.

References
1. Cho, J.H., Swami, A.: Dynamics of uncertain opinions in social networks. In: 2014
IEEE Military Communications Conference, pp. 1627–1632 (2014)
2. Crosscombe, M., Lawry, J.: A model of multi-agent consensus for vague and uncer-
tain beliefs. Adapt. Behav. 24(4), 249–260 (2016)
3. Crosscombe, M., Lawry, J., Hauert, S., Homer, M.: Robust distributed decision-
making in robot swarms: exploiting a third truth state. In: 2017 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS), pp. 4326–4332.
IEEE (September 2017). https://doi.org/10.1109/IROS.2017.8206297
4. Dabarera, R., Núñez, R., Premaratne, K., Murthi, M.N.: Dynamics of belief theo-
retic agent opinions under bounded confidence. In: 17th International Conference
on Information Fusion (FUSION), pp. 1–8 (2014)
5. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
Ann. Math. Stat. 38(2), 325–339 (1967)
6. Douven, I., Kelp, C.: Truth approximation, social epistemology, and opinion
dynamics. Erkenntnis 75, 271–283 (2011)
7. Dubois, D., Liu, W., Ma, J., Prade, H.: The basic principles of uncertain infor-
mation fusion. An organised review of merging rules in different representation
frameworks. Inf. Fus. 32, 12–39 (2016). https://doi.org/10.1016/j.inffus.2016.02.
006
8. Dubois, D., Prade, H.: Representation and combination of uncertainty with belief
functions and possibility measures. Comput. Intell. 4(3), 244–264 (1988). https://
doi.org/10.1111/j.1467-8640.1988.tb00279.x
9. Dubois, D., Prade, H.: Possibility theory, probability theory and multiple-valued
logics: a clarification. Ann. Math. Artif. Intell. 32(1), 35–66 (2001). https://doi.
org/10.1023/A:1016740830286
10. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence: models,
analysis and simulation. J. Artif. Soc. Soc. Simul. 5, 2 (2002)
11. Jøsang, A.: The consensus operator for combining beliefs. Artif. Intell. 141(1–2),
157–170 (2002). https://doi.org/10.1016/S0004-3702(02)00259-X
12. Kanjanatarakul, O., Denoux, T.: Distributed data fusion in the dempster-shafer
framework. In: 2017 12th System of Systems Engineering Conference (SoSE), pp.
1–6. IEEE (June 2017). https://doi.org/10.1109/SYSOSE.2017.7994954
Evidence Propagation and Consensus Formation in Noisy Environments 323

13. Lee, C., Lawry, J., Winfield, A.: Combining opinion pooling and evidential updat-
ing for multi-agent consensus. In: Proceedings of the Twenty-Seventh International
Joint Conference on Artificial Intelligence (IJCAI-2018) Combining, pp. 347–353
(2018)
14. Lee, C., Lawry, J., Winfield, A.: Negative updating combined with opinion pooling
in the best-of-n problem in swarm robotics. In: Dorigo, M., Birattari, M., Blum,
C., Christensen, A.L., Reina, A., Trianni, V. (eds.) ANTS 2018. LNCS, vol. 11172,
pp. 97–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00533-7 8
15. Lu, X., Mo, H., Deng, Y.: An evidential opinion dynamics model based on het-
erogeneous social influential power. Chaos Solitons Fractals 73, 98–107 (2015).
https://doi.org/10.1016/j.chaos.2015.01.007
16. Parker, C.A.C., Zhang, H.: Cooperative decision-making in decentralized multiple-
robot systems: the best-of-n problem. IEEE/ASME Trans. Mechatron. 14(2), 240–
251 (2009). https://doi.org/10.1109/TMECH.2009.2014370
17. Reina, A., Bose, T., Trianni, V., Marshall, J.A.R.: Effects of spatiality on value-
sensitive decisions made by robot swarms. In: Groß, R., et al. (eds.) Distributed
Autonomous Robotic Systems. SPAR, vol. 6, pp. 461–473. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-73008-0 32
18. Reina, A., Marshall, J.A.R., Trianni, V., Bose, T.: Model of the best-of-n nest-site
selection process in honeybees. Phys. Rev. E 95, 052411 (2017). https://doi.org/
10.1103/PhysRevE.95.052411
19. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
20. Smets, P.: Analyzing the combination of conflicting belief functions. Inf. Fusion
8(4), 387–412 (2007). https://doi.org/10.1016/j.inffus.2006.04.003
21. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66, 387–412
(1994)
22. Valentini, G., Ferrante, E., Dorigo, M.: The best-of-n problem in robot swarms:
formalization, state of the art, and novel perspectives. Front. Robot. AI 4, 9 (2017).
https://doi.org/10.3389/frobt.2017.00009
23. Valentini, G., Hamann, H., Dorigo, M.: Self-organized collective decision making:
the weighted voter model. In: Proceedings of the 2014 International Conference
on Autonomous Agents and Multi-agent Systems, pp. 45–52. AAMAS 2014. Inter-
national Foundation for Autonomous Agents and Multiagent Systems, Richland
(2014)
24. Wickramarathne, T.L., Premaratine, K., Murthi, M.N., Chawla, N.V.: Conver-
gence analysis of iterated belief revision in complex fusion environments. IEEE
J. Sel. Top. Signal Process. 8(4), 598–612 (2014). https://doi.org/10.1109/JSTSP.
2014.2314854
25. Yager, R.R.: On the specificity of a possibility distribution. Fuzzy Sets Syst. 50(3),
279–292 (1992). https://doi.org/10.1016/0165-0114(92)90226-T
Order-Independent Structure Learning
of Multivariate Regression Chain Graphs

Mohammad Ali Javidian, Marco Valtorta(B) , and Pooyan Jamshidi

University of South Carolina, Columbia, USA


[email protected], {mgv,pjamshid}@cse.sc.edu

Abstract. This paper deals with multivariate regression chain graphs


(MVR CGs), which were introduced by Cox and Wermuth in the nineties
to represent linear causal models with correlated errors. We consider the
PC-like algorithm for structure learning of MVR CGs, a constraint-based
method proposed by Sonntag and Peña in 2012. We show that the PC-
like algorithm is order-dependent, because the output can depend on the
order in which the variables are given. This order-dependence is a minor
issue in low-dimensional settings. However, it can be very pronounced in
high-dimensional settings, where it can lead to highly variable results.
We propose two modifications of the PC-like algorithm that remove part
or all of this order-dependence. Simulations under a variety of settings
demonstrate the competitive performance of our algorithms in compari-
son with the original PC-like algorithm in low-dimensional settings and
improved performance in high-dimensional settings.

Keywords: Multivariate regression chain graph · Structural learning ·


Order independence · High-dimensional data · Scalable machine
learning techniques

1 Introduction
Chain graphs were introduced by Lauritzen, Wermuth and Frydenberg [5,9]
as a generalization of graphs based on undirected graphs and directed acyclic
graphs (DAGs). Later Andersson, Madigan and Perlman introduced an alter-
native Markov property for chain graphs [1]. In 1993 [3], Cox and Wermuth
introduced multivariate regression chain graphs (MVR CGs). The different inter-
pretations of CGs have different merits, but none of the interpretations subsumes
another interpretation [4].
Acyclic directed mixed graphs (ADMGs), also known as semi-Markov(ian)
[12] models contain directed (→) and bidirected (↔) edges subject to the restric-
tion that there are no directed cycles [15]. An ADMG that has no partially
directed cycle is called a multivariate regression chain graph. Cox and Wermuth
represented these graphs using directed edges and dashed edges, but we fol-
low Richardson [15] because bidirected edges allow the m-separation criterion
Supported by AFRL and DARPA (FA8750-16-2-0042).
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 324–338, 2019.
https://doi.org/10.1007/978-3-030-35514-2_24
Order-Independent Structure Learning of MVR CGS 325

(defined in Sect. 2) to be viewed more directly as an extension of d-separation


than is possible with dashed edges [15].
Unlike in the other CG interpretations, the bidirected edge in MVR CGs
has a strong intuitive meaning. It can be seen to represent one or more hidden
common causes between the variables connected by it. In other words, in an MVR
CG any bidirected edge X ↔ Y can be replaced by X ← H → Y to obtain a
Bayesian network representing the same independence model over the original
variables, i.e. excluding the new variables H. These variables are called hidden,
or latent, and have been marginalized away in the CG model. See [7,17,18] for
details on the properties of MVR chain graphs.
Two constraint-based learning algorithms, that use a statistical analysis to
test the presence of a conditional independency, exist for learning MVR CGs: (1)
the PC-like algorithm [16], and (2) the answer set programming (ASP) algorithm
[13]. The PC-like algorithm extends the original learning algorithm for Bayesian
networks by Peter Spirtes and Clark Glymour [19]. It learns the structure of
the underlying MVR chain graph in four steps: (a) determining the skeleton:
the resulting undirected graph in this phase contains an undirected edge u − v
iff there is no set S ⊆ V \ {u, v} such that u ⊥⊥ v|S; (b) determining the v -
structures (unshielded colliders); (c) orienting some of the undirected/directed
edges into directed/bidirected edges according to a set of rules applied iteratively;
(d) transforming the resulting graph in the previous step into an MVR CG.
The essential recovery algorithm obtained after step (c) contains all directed
and bidirected edges that are present in every MVR CG of the same Markov
equivalence class.
In this paper, we show that the PC-like algorithm is order-dependent, in the
sense that the output can depend on the order in which the variables are given.
We propose several modifications of the PC-like algorithm that remove part or all
of this order-dependence, but do not change the result when perfect conditional
independence information is used. When applied to data, the modified algorithms
are partly or fully order-independent. Proofs, implementations in R, and details
of experimental results can be found in the supplementary material at https://
github.com/majavid/SUM2019.

2 Definitions and Concepts


Below we briefly list some of the most important concepts used in this paper.
If there is an arrow from a pointing towards b, a is said to be a parent of b.
The set of parents of b is denoted as pa(b). If there is a bidirected edge between
a and b, a and b are said to be neighbors. The set of neighbors of a vertex a
is denoted as ne(a). The expressions pa(A) and ne(A) denote the collection of
parents and neighbors of vertices in A that are not themselves elements of A.
The boundary bd(A) of a subset A of vertices is the set of vertices in V \ A that
are parents or neighbors to vertices in A.
A path of length n from a to b is a sequence a = a0 , . . . , an = b of distinct
vertices such that (ai → ai+1 ) ∈ E, for all i = 1, . . . , n. A chain of length n from
326 M. A. Javidian et al.

a to b is a sequence a = a0 , . . . , an = b of distinct vertices such that (ai → ai+1 ) ∈


E, or (ai+1 → ai ) ∈ E, or (ai+1 ↔ ai ) ∈ E, for all i = 1, . . . , n. We say that u is
an ancestor of v and v is a descendant of u if there is a path from u to v in G.
The set of ancestors of v is denoted as an(v), and we define An(v) = an(v) ∪ v.
We apply this definition to sets: an(X) = {α|α is an ancestor of β for some β ∈
X}. A partially directed cycle in a graph G is a sequence of n distinct vertices
v1 , . . . , vn (n ≥ 3),and vn+1 ≡ v1 , such that ∀i(1 ≤ i ≤ n) either vi ↔ vi+1 or
vi → vi+1 , and ∃j(1 ≤ j ≤ n) such that vi → vi+1 .
A graph with only undirected edges is called an undirected graph (UG). A
graph with only directed edges and without directed cycles is called a directed
acyclic graph (DAG). Acyclic directed mixed graphs, also known as semi-
Markov(ian) [12] models contain directed (→) and bidirected (↔) edges subject
to the restriction that there are no directed cycles [15]. A graph that has no
partially directed cycles is called a chain graph.
A non endpoint vertex ζ on a chain is a collider on the chain if the edges
preceding and succeeding ζ on the chain have an arrowhead at ζ, that is, → ζ ←,
or ↔ ζ ↔, or ↔ ζ ←, or → ζ ↔. A nonendpoint vertex ζ on a chain which is
not a collider is a noncollider on the chain. A chain between vertices α and β in
chain graph G is said to be m-connecting given a set Z (possibly empty), with
α, β ∈ / Z, if every noncollider on the path is not in Z, and every collider on the
path is in AnG (Z).
A chain that is not m-connecting given Z is said to be blocked given (or by)
Z. If there is no chain m-connecting α and β given Z, then α and β are said to
be m-separated given Z. Sets X and Y are m-separated given Z, if for every pair
α, β, with α ∈ X and β ∈ Y , α and β are m-separated given Z (X, Y , and Z are
disjoint sets; X, Y are nonempty). We denote the independence model resulting
from applying the m-separation criterion to G, by m (G). This is an extension
of Pearl’s d-separation criterion [11] to MVR chain graphs in that in a DAG D,
a chain is d-connecting if and only if it is m-connecting.
We say that two MVR CGs G and H are Markov equivalent or that they are
in the same Markov equivalence class iff m (G) = m (H). If G and H have the
same adjacencies and unshielded colliders, then m (G) = m (H) [21].
Just like for many other probabilistic graphical models there might exist
multiple MVR CGs that represent the same independence model. Sometimes
it can however be desirable to have a unique graphical representation of the
different representable independence models in the MVR CGs interpretation. A
graph G∗ is said to be the essential MVR CG of an MVR CG G if it has the
same skeleton as G and contains all and only the arrowheads common to every
MVR CG in the Markov equivalence class of G. One thing that can be noted
here is that an essential MVR CG does not need to be a MVR CG. Instead these
graphs can contain three types of edges, undirected, directed and bidirected [17].

3 Order-Dependent PC-Like Algorithm


In this section, we show that the PC-like algorithm proposed by Sonntag and
Peña in [16] is order-dependent, in the sense that the output can depend on
Order-Independent Structure Learning of MVR CGS 327

the order in which the variables are given. The PC-like algorithm for learn-
ing MVR CGs under the faithfulness assumption is formally described in
Algorithm 1.

Algorithm 1. The order-dependent PC-like algorithm for learning MVR


chain graphs [16]
Input: A set V of nodes and a probability distribution p faithful to an
unknown MVR CG G and an ordering order(V ) on the variables.
Output: An MVR CG G s.t. G and G are Markov equivalent and G has
exactly the minimum set of bidirected edges for its equivalence class.
1 Let H denote the complete undirected graph over V ;
/* Skeleton Recovery */
2 for i ← 0 to |VH | − 2 do
3 while possible do
4 Select any ordered pair of nodes u and v in H such that u ∈ adH (v) and
|adH (u) \ v| ≥ i using order(V );
/* adH (x) := {y ∈ V |x y, y x, or x y} */
5 if there exists S ⊆ (adH (u) \ v) s.t. |S| = i and u ⊥ ⊥p v|S (i.e., u is
independent of v given S in the probability distribution p) then
6 Set Suv = Svu = S;
7 Remove the edge u v from H;
8 end
9 end
10 end
/* v-structure Recovery */
11 for each m-separator Suv do
12 if u w v appears in the skeleton and w is not in Suv then
/* u w means u w or u w. Also, w v means
w v or w v. */
13 Determine a v-structure u w v;
14 end
15 end
16 Apply rules 1-3 in Figure 1 while possible;
/* After this line, the learned graph is the essential graph of MVR
CG G. */
17 Let Gu be the subgraph of G containing only the nodes and the undirected
edges in G ;
18 Let T be the junction tree of Gu ;
/* If Gu is disconnected, the cliques belonging to different
connected components can be linked with empty separators, as
described in [6, Theorem 4.8]Golumbic. */
19 Order the cliques C1 , · · · , Cn of Gu s.t. C1 is the root of T and if Ci is closer to
the root than Cj in T then Ci < Cj ;
20 Order the nodes such that if A ∈ Ci , B ∈ Cj , and Ci < Cj then A < B;
21 Orient the undirected edges in G according to the ordering obtained in line 21.
328 M. A. Javidian et al.

Fig. 1. The rules [16]

In applications, we do not have perfect conditional independence information.


Instead, we assume that we have an i.i.d. sample of size n of variables V =
(X1 , . . . , Xp). In the PC-like algorithm [16] all conditional independence queries
are estimated by statistical conditional independence tests at some pre-specified
significance level (p value) α. For example, if the distribution of V is multivariate
Gaussian, one can test for zero partial correlation, see, e.g., [8]. For this purpose,
we use the gaussCItest() function from the R package pcalg throughout this
paper. Let order(V ) denote an ordering on the variables in V . We now consider
the role of order(V ) in every step of the algorithm.
In the skeleton recovery phase of the PC-like algorithm [16], the order of
variables affects the estimation of the skeleton and the separating sets. In par-
ticular, as noted for the special case of Bayesian networks in [2], for each level
of i, the order of variables determines the order in which pairs of adjacent ver-
tices and subsets S of their adjacency sets are considered (see lines 4 and 5 in
Algorithm 1). The skeleton H is updated after each edge removal. Hence, the
adjacency sets typically change within one level of i, and this affects which other
conditional independencies are checked, since the algorithm only conditions on
subsets of the adjacency sets. When we have perfect conditional independence
information, all orderings on the variables lead to the same output. In the sam-
ple version, however, we typically make mistakes in keeping or removing edges,
because conditional independence relationships have to be estimated from data.
In such cases, the resulting changes in the adjacency sets can lead to different
skeletons, as illustrated in Example 1.
Moreover, different variable orderings can lead to different separating sets
in the skeleton recovery phase. When we have perfect conditional independence
information, this is not important, because any valid separating set leads to
the correct v -structure decision in the orientation phase. In the sample version,
however, different separating sets in the skeleton recovery phase of the algorithm
may yield different decisions about v -structures in the orientation phase. This is
illustrated in Example 2.
Finally, we consider the role of order(V ) on the orientation rules in the essen-
tial graph recovery phase of the sample version of the PC-like algorithm. Example
3 illustrates that different variable orderings can lead to different orientations,
even if the skeleton and separating sets are order-independent.
Example 1 (Order-dependent skeleton of the PC-like algorithm). Suppose that
the distribution of V = {a, b, c, d, e} is faithful to the DAG in Fig. 2(a). This DAG
encodes the following conditional independencies (using the notation defined in
Order-Independent Structure Learning of MVR CGS 329

line 5 of Algorithm 1) with minimal separating sets: a ⊥⊥ d|{b, c} and a ⊥⊥


e|{b, c}.
Suppose that we have an i.i.d. sample of (a, b, c, d, e), and that the following
conditional independencies with minimal separating sets are judged to hold at
some significance level α: a ⊥⊥ d|{b, c}, a ⊥⊥ e|{b, c, d}, and c ⊥⊥ e|{a, b, d}. Thus,
the first two are correct, while the third is false.
We now apply the skeleton recovery phase of the PC-like algorithm with two
different orderings: order1 (V ) = (d, e, a, c, b) and order2 (V ) = (d, c, e, a, b). The
resulting skeletons are shown in Figs. 2(b) and (c), respectively.

a a a

b c b c b c

d d d

e e e

(a) (b) (c)

Fig. 2. (a) The DAG G, (b) the skeleton returned by Algorithm 1 with order1 (V ), (c)
the skeleton returned by Algorithm 1 with order2 (V ).

We see that the skeletons are different, and that both are incorrect as the
edge c e is missing. The skeleton for order2 (V ) contains an additional error,
as there is an additional edge a e. We now go through Algorithm 1 to see
what happened. We start with a complete undirected graph on V . When i = 0,
variables are tested for marginal independence, and the algorithm correctly does
not remove any edge. Also, when i = 1, the algorithm correctly does not remove
any edge. When i = 2, there is a pair of vertices that is thought to be condition-
ally independent given a subset of size two, and the algorithm correctly removes
the edge between a and d. When i = 3, there are two pairs of vertices that are
thought to be conditionally independent given a subset of size three. Table 1
shows the trace table of Algorithm 1 for i = 3 and order1 (V ) = (d, e, a, c, b).

Table 1. The trace table of Algorithm 1 for i = 3 and order1 (V ) = (d, e, a, c, b).

Ordered pair (u, v) adH (u) Suv Is Suv ⊆ adH (u) \ {v}? Is u v removed?
(e, a) {a, b, c, d} {b, c, d} Yes Yes
(e, c) {b, c, d} {a, b, d} No No
(c, e) {a, b, d, e} {a, b, d} Yes Yes

Table 2 shows the trace table of Algorithm 1 for i = 3 and order2 (V ) =


(d, c, e, a, b).
330 M. A. Javidian et al.

Table 2. The trace table of Algorithm 1 for i = 3 and order2 (V ) = (d, c, e, a, b).

Ordered Pair (u, v) adH (u) Suv Is Suv ⊆ adH (u) \ {v}? Is u v removed?
(c, e) {a, b, d, e} {a, b, d} Yes Yes
(e, a) {a, b, d} {b, c, d} No No
(a, e) {b, c, e} {b, c, d} No No

Example 2 (Order-dependent separating sets and v-structures of the PC-like


algorithm). Suppose that the distribution of V = {a, b, c, d, e} is faithful to the
DAG in Fig. 3(a). This DAG encodes the following conditional independencies
with minimal separating sets: a ⊥⊥ d|b, a ⊥⊥ e|{b, c}, a ⊥⊥ e|{c, d}, b ⊥⊥ c, b ⊥⊥ e|d,
and c ⊥⊥ d.
Suppose that we have an i.i.d. sample of (a, b, c, d, e). Assume that all true
conditional independencies are judged to hold except c ⊥⊥ d. Suppose that c ⊥⊥
d|b and c ⊥⊥ d|e are thought to hold. Thus, the first is correct, while the second is
false. We now apply the v -structure recovery phase of the PC-like algorithm with
two different orderings: order1 (V ) = (d, c, b, a, e) and order3 (V ) = (c, d, e, a, b).
The resulting CGs are shown in Figs. 3(b) and (c), respectively. Note that while
the separating set for vertices c and d with order1 (V ) is Sdc = Scd = {b}, the
separating set for them with order2 (V ) is Scd = Sdc = {e}.
This illustrates that order-dependent separating sets in the skeleton recovery
phase of the sample version of the PC-algorithm can lead to order-dependent
v -structures.

a a a
b c b c b c

d e d e d e
(a) (b) (c)

Fig. 3. (a) The DAG G, (b) the CG returned after the v -structure recovery phase of
Algorithm 1 with order1 (V ), (c) the CG returned after the v -structure recovery phase
of Algorithm 1 with order3 (V ).

Example 3 (Order-dependent orientation rules of the PC-like algorithm). Con-


sider the graph in Fig. 4, and assume that this is the output of the sample
version of the PC-like algorithm after v -structure recovery. Also, consider that
c ∈ Sa,d and d ∈ Sb,f . Thus, we have two v-structures, namely a c e
and b d f , and four unshielded triples, namely (e, c, d), (c, d, f ), (a, c, d),
and (b, d, c). Thus, we then apply the orientation rules in the essential recovery
phase of the algorithm, starting with rule R1. If one of the two unshielded triples
(e, c, d) or (a, c, d) is considered first, we obtain c d. On the other hand, if
one of the unshielded triples (b, d, c) or (c, d, f ) is considered first, then we obtain
c d. Note that we have no issues with overwriting of edges here, since as soon
Order-Independent Structure Learning of MVR CGS 331

as the edge c d is oriented, all edges are oriented and no further orientation
rules are applied. These examples illustrate that the essential graph recovery
phase of the PC-like algorithm can be order-dependent regardless of the output
of the previous steps.

a b

e c d f

Fig. 4. Possible mixed graph after v -structure recovery phase of the sample version of
the PC-like algorithm.

4 Order Independent Algorithms for Learning MVR CGs


We now propose several modifications of the original PC-like algorithm (and
hence also of the related algorithms) that remove the order-dependence in the
various stages of the algorithm, analogously to what Colombo and Maathuis [2]
did for the original PC algorithm in the case of DAGs. For this purpose, we
discuss the skeleton, v -structures, and the orientation rules, respectively.

4.1 Order-Independent Skeleton Recovery


We first consider estimation of the skeleton in the adjacency search of the PC-
like algorithm. The pseudocode for our modification is given in Algorithm 2. The
resulting PC-like algorithm in Algorithm 2 is called stable PC-like.
The main difference between Algorithms 1 and 2 is given by the for-loop on
lines 3–5 in the latter one, which computes and stores the adjacency sets aH (vi ) of
all variables after each new size i of the conditioning sets. These stored adjacency
sets aH (vi ) are used whenever we search for conditioning sets of this given size
i. Consequently, an edge deletion on line 10 no longer affects which conditional
independencies are checked for other pairs of variables at this level of i.
In other words, at each level of i, Algorithm 2 records which edges should
be removed, but for the purpose of the adjacency sets it removes these edges
only when it goes to the next value of i. Besides resolving the order-dependence
in the estimation of the skeleton, our algorithm has the advantage that it is
easily parallelizable at each level of i. The stable PC-like algorithm is correct,
i.e. it returns an MVR CG to which the given probability distribution is faithful
(Theorem 1), and it yields order-independent skeletons in the sample version
(Theorem 2). We illustrate the algorithm in Example 4.
Theorem 1. Let the distribution of V be faithful to an MVR CG G, and assume
that we are given perfect conditional independence information about all pairs of
variables (u, v) in V given subsets S ⊆ V \ {u, v}. Then the output of the stable
PC-like algorithm is an MVR CG that has exactly the minimum set of bidirected
edges for its equivalence class.
332 M. A. Javidian et al.

Theorem 2. The skeleton resulting from the sample version of the stable PC-
like algorithm is order-independent.

Example 4 (Order-independent skeletons). We go back to Example 1, and con-


sider the sample version of Algorithm 2. The algorithm now outputs the skele-
ton shown in Fig. 2(b) for both orderings order1 (V ) and order2 (V ). We again
go through the algorithm step by step. We start with a complete undirected
graph on V . No conditional independence found when i = 0. Also, when i = 1,
the algorithm correctly does not remove any edge. When i = 2, the algorithm
first computes the new adjacency sets: aH (v) = V \ {v}, ∀v ∈ V . There is a
pair of variables that is thought to be conditionally independent given a sub-
set of size two, namely (a, d). Since the sets aH (v) are not updated after edge
removals, it does not matter in which order we consider the ordered pair. Any
ordering leads to the removal of edge between a and d. When i = 3, the algo-
rithm first computes the new adjacency sets: aH (a) = aH (d) = {b, c, e} and
aH (v) = V \{v}, for v = b, c, e. There are two pairs of variables that are thought
to be conditionally independent given a subset of size three, namely (a, e) and
(c, e). Since the sets aH (v) are not updated after edge removals, it does not
matter in which order we consider the ordered pair. Any ordering leads to the
removal of both edges a e and c e.

Algorithm 2. The order-independent (stable) PC-like algorithm for learn-


ing MVR chain graphs.
Input: A set V of nodes and a probability distribution p faithful to an
unknown MVR CG G and an ordering order(V ) on the variables.
Output: An MVR CG G s.t. G and G are Markov equivalent and G has
exactly the minimum set of bidirected edges for its equivalence class.
1 Let H denote the complete undirected graph over V = {v1 , . . . , vn };
/* Skeleton Recovery */
2 for i ← 0 to |VH | − 2 do
3 for j ← 1 to |VH | do
4 Set aH (vi ) = adH (vi );
5 end
6 while possible do
7 Select any ordered pair of nodes u and v in H such that u ∈ aH (v) and
|aH (u) \ v| ≥ i using order(V );
8 if there exists S ⊆ (aH (u) \ v) s.t. |S| = i and u ⊥
⊥p v|S (i.e., u is
independent of v given S in the probability distribution p) then
9 Set Suv = Svu = S;
10 Remove the edge u v from H;
11 end
12 end
13 end
/* v-structure Recovery and orientation rules */
14 Follow the same procedures in Algorithm 1 (lines: 11–21).
Order-Independent Structure Learning of MVR CGS 333

4.2 Order-Independent v -structures Recovery

We propose two methods to resolve the order-dependence in the determination


of the v -structures, using the conservative PC algorithm (CPC) of Ramsey et al.
[14] and the majority rule PC-like algorithm (MPC) of Colombo & Maathuis [2].
The Conservative PC-like algorithm (CPC-like algorithm) works as
follows. Let H be the undirected graph resulting from the skeleton recovery phase
of the PC-like algorithm (Algorithm 1). For all unshielded triples (Xi , Xj , Xk )
in H, determine all subsets S of adH (Xi ) and of adH (Xk ) that make Xi and
Xk conditionally independent, i.e., that satisfy Xi ⊥⊥p Xk |S. We refer to such
sets as separating sets. The triple (Xi , Xj , Xk ) is labelled as unambiguous if at
least one such separating set is found and either Xj is in all separating sets or in
none of them; otherwise it is labelled as ambiguous. If the triple is unambiguous,
it is oriented as v -structure if and only if Xj is in none of the separating sets.
Moreover, in the v -structure recovery phase of the PC-like algorithm (Algorithm
1, lines 11–15), the orientation rules are adapted so that only unambiguous triples
are oriented. The output of the CPC-like algorithm is a mixed graph in which
ambiguous triples are marked. We refer to the combination of the stable PC-like
and CPC-like algorithms as the stable CPC-like algorithm.
In the case of DAGs, Colombo and Maathuis [2] found that the CPC-
algorithm can be very conservative, in the sense that very few unshielded triples
are unambiguous in the sample version, where conditional independence rela-
tionships have to be estimated from data. They proposed a minor modification
of the CPC approach, called Majority rule PC algorithm (MPC) to mitigate the
(unnecessary) severity of CPC-like approach. We similarly propose the Major-
ity rule PC-like algorithm (MPC-like) for MVR CGs. As in the CPC-like
algorithm, we first determine all subsets S of adH (Xi ) and of adH (Xk ) that make
Xi and Xk conditionally independent, i.e., that satisfy Xi ⊥⊥p Xk |S. The triple
(Xi , Xj , Xk ) is labelled as (α, β)-unambiguous if at least one such separating set
is found or Xj is in no more than α% or no less than β% of the separating sets,
for 0 ≤ α ≤ β ≤ 100. Otherwise it is labelled as ambiguous. (As an example,
consider α = 30 and β = 60.) If a triple is unambiguous, it is oriented as a
v -structure if and only if Xj is in less than α% of the separating sets. As in
the CPC-like algorithm, the orientation rules in the v -structure recovery phase
of the PC-like algorithm (Algorithm 1, lines 11–15) are adapted so that only
unambiguous triples are oriented, and the output is a mixed graph in which
ambiguous triples are marked. Note that the CPC-like algorithm is the special
case of the MPC-like algorithm with α = 0 and β = 100. We refer to the com-
bination of the stable PC-like and MPC-like algorithms as the stable MPC-like
algorithm.

Theorem 3. Let the distribution of V be faithful to an MVR CG G, and assume


that we are given perfect conditional independence information about all pairs of
variables (u, v) in V given subsets S ⊆ V \ {u, v}. Then the output of the (stable)
CPC/MPC-like algorithm is an MVR CG that is Markov equivalent with G that
has exactly the minimum set of bidirected edges for its equivalence class.
334 M. A. Javidian et al.

Theorem 4. The decisions about v-structures in the sample version of the stable
CPC/MPC-like algorithm is order-independent.
Example 5 (Order-independent decisions about v-structures). We consider the
sample versions of the stable CPC/MPC-like algorithm, using the same input
as in Example 2. In particular, we assume that all conditional independencies
induced by the MVR CG in Fig. 3(a) are judged to hold except c ⊥⊥ d. Suppose
that c ⊥⊥ d|b and c ⊥⊥ d|e are thought to hold. Let α = β = 50.
Denote the skeleton after the skeleton recovery phase by H. We consider
the unshielded triple (c, e, d). First, we compute aH (c) = {a, d, e} and aH (d) =
{a, b, c, e}, when i = 1. We now consider all subsets S of these adjacency sets,
and check whether c ⊥⊥ d|S. The following separating sets are found: {b}, {e},
and {b, e}. Since e is in some but not all of these separating sets, the stable CPC-
like algorithm determines that the triple is ambiguous, and no orientations are
performed. Since e is in more than half of the separating sets, stable MPC-like
determines that the triple is unambiguous and not a v -structure. The output of
both algorithms is given in Fig. 3(c).
At this point it should be clear why the modified PC-like algorithm is labeled
“conservative”: it is more cautious than the (stable) PC-like algorithm in drawing
unambiguous conclusions about orientations. As we showed in Example 5, the
output of the (stable) CPC-like algorithm may not be collider equivalent with
the true MVR CG G, if the resulting CG contains an ambiguous triple.

4.3 Order-Independent Orientation Rules


Even when the skeleton and the determination of the v -structures are order-
independent, Example 3 showed that there might be some order-dependent steps
left in the sample version. Regarding the orientation rules, we note that the PC-
like algorithm does not suffer from conflicting v -structures (as shown in [2] for
the PC-algorithm in the case of DAGs), because bi-directed edges are allowed.
However, the three orientation rules still suffer from order-dependence issues
(see Example 3 and Fig. 4). To solve this problem, we can use lists of candidate
edges for each orientation rule as follows: we first generate a list of all edges
that can be oriented by rule R1. We orient all these edges, creating bi-directed
edges if there are conflicts. We do the same for rules R2 and R3, and iterate this
procedure until no more edges can be oriented.
When using this procedure, we add the letter L (standing for lists), e.g.,
(stable) LCPC-like and (stable) LMPC-like. The (stable) LCPC-like and (stable)
LMPC-like algorithms are fully order-independent in the sample versions. The
procedure is illustrated in Example 6.
Theorem 5. Let the distribution of V be faithful to an MVR CG G, and assume
that we are given perfect conditional independence information about all pairs of
variables (u, v) in V given subsets S ⊆ V \ {u, v}. Then the output of the (stable)
LCPC/LMPC-like algorithm is an MVR CG that is Markov equivalent with G
that has exactly the minimum set of bidirected edges for its equivalence class.
Order-Independent Structure Learning of MVR CGS 335

Theorem 6. The sample versions of stable CPC-like and stable MPC-like algo-
rithms are fully order-independent.

Example 6. Consider the structure shown in Fig. 4. As a first step, we construct


a list containing all candidate structures eligible for orientation rule R1 in the
phase of the essential graph recovery. The list contains the unshielded triples
(e, c, d), (c, d, f ), (a, c, d), and (b, d, c). Now, we go through each element in the
list and we orient the edges accordingly, allowing bi-directed edges. This yields
the edge orientation c d, regardless of the ordering of the variables.

5 Evaluation
In this section, we compare the performance of our algorithms (Table 3) with
the original PC-like learning algorithm by running them on randomly generated
MVR chain graphs in low-dimensional and high-dimensional data, respectively.
We report on the Gaussian case only because of space limitations.
We evaluate the performance of the proposed algorithms in terms of the six
measurements that are commonly used [2,8,10,20] for constraint-based learning
algorithms: (a) the true positive rate (TPR) (also known as sensitivity, recall, and
hit rate), (b) the false positive rate (FPR) (also known as fall-out), (c) the true
discovery rate (TDR) (also known as precision or positive predictive value), (d)
accuracy (ACC) for the skeleton, (e) the structural Hamming distance (SHD)
(this is the metric described in [20] to compare the structure of the learned
and the original graphs), and (f) run-time for the LCG recovery algorithms. In
principle, large values of TPR, TDR, and ACC, and small values of FPR and
SHD indicate good performance. All of these six measurements are computed on
the essential graphs of the CGs, rather than the CGs directly, to avoid spurious
differences due to random orientation of undirected edges.

Table 3. Order-dependence issues and corresponding modifications of the PC-like


algorithm that remove the problem. “Yes” indicates that the corresponding aspect of
the graph is estimated order-independently in the sample version.

Skeleton v -structures decisions Edges orientations


PC-like No No No
Stable PC-like Yes No No
Stable CPC/MPC-like Yes Yes No
Stable LCPC/LMPC-like Yes Yes Yes

Figure 5 shows that: (a) as we expected [8,10], all algorithms work well on
sparse graphs (N = 2), (b) for all algorithms, typically the TPR, TDR, and
ACC increase with sample size, (c) for all algorithms, typically the SHD and FPR
decrease with sample size, (d) a large significance level (α = 0.05) typically yields
336 M. A. Javidian et al.

P= 50 , N= 2 , OPC vs SPC P= 50 , N= 2 , OPC vs SPC P= 50 , N= 2 , OPC vs SPC


0.990

0.95
OPC OPC OPC

8e−04
SPC SPC SPC
0.980 0.985

0.85 0.90
TDR

TPR

FPR
6e−04
0.80

4e−04
0.975

500 1000 5000 10000 500 1000 5000 10000 500 1000 5000 10000
Sample size Sample size Sample size

P= 50 , N= 2 , OPC vs SPC P= 50 , N= 2 , OPC vs SPC P= 50 , N= 2 , OPC vs SPC

0.25
OPC OPC OPC
SPC SPC SPC
0.998

18

0.10 0.15 0.20


runtime (sec)
SHD
ACC

14
0.994

8 10
0.990

500 1000 5000 10000 500 1000 5000 10000 500 1000 5000 10000
Sample size Sample size Sample size

P= 1000 , N= 2 , OPC vs SPC P= 1000 , N= 2 , OPC vs SPC P= 1000 , N= 2 , OPC vs SPC


7e−04
0.52

OPC OPC OPC


SPC SPC SPC
0.9

0.44 0.48

4e−04
0.8
TDR

TPR

FPR
0.7

1e−04
0.40
0.6

0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001
p values p values p values

P= 1000 , N= 2 , OPC vs SPC P= 1000 , N= 2 , OPC vs Modified PCs P= 1000 , N= 2 , OPC vs SPC
0.9988

1200

15 16 17 18 19 20

OPC OPC OPC


SPC SPC SPC
stable CPC
runtime (sec)

stable MPC
0.9986

stable LCPC
1000
SHD
ACC

stable LMPC
900
0.9984

800

0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001
p values p values p values

Fig. 5. The first two rows show the performance of the original (OPC) and stable PC-
like (SPC) algorithms for randomly generated Gaussian chain graph models: average
over 30 repetitions with 50 variables correspond to N = 2, and the significance level α =
0.001. The last two rows show the performance of the original (OPC) and stable PC-
like (SPC) algorithms for randomly generated Gaussian chain graph models: average
over 30 repetitions with 1000 variables correspond to N = 2, sample size S = 50, and
the significance level α = 0.05, 0.01, 0.005, 0.001.
Order-Independent Structure Learning of MVR CGS 337

large TPR, FPR, and SHD, (e) while the stable PC-like algorithm has a better
TDR and FPR in comparison with the original PC-like algorithm, the original
PC-like algorithm has a better TPR (as observed in the case of DAGs [2]). This
can be explained by the fact that the stable PC-like algorithm tends to perform
more tests than the original PC-like algorithm, and (h) while the original PC-
like algorithm has a (slightly) better SHD in comparison with the stable PC-like
algorithm in low-dimensional data, the stable PC-like algorithm has a better
SHD in high-dimensional data. Also, (very) small variances indicate that the
order-independent versions of the PC-like algorithm in high-dimensional data are
stable. When considering average running times versus sample sizes, as shown
in Fig. 5, we observe that: (a) the average run time increases when sample size
increases; (b) generally, the average run time for the original PC-like algorithm
is (slightly) better than that for the stable PC-like algorithm in both low and
high dimensional settings.
In summary, empirical simulations show that our algorithms achieve com-
petitive results with the original PC-like learning algorithm; in particular, in the
Gaussian case the order-independent algorithms achieve output of better qual-
ity than the original PC-like algorithm, especially in high-dimensional settings.
Since we know of no score-based learning algorithms for MVR chain graphs (and,
in fact, for any kind of chain graphs), we plan to investigate the feasibility of a
scalable algorithm of this kind.

References
1. Andersson, S.A., Madigan, D., Perlman, M.D.: An alternative Markov property
for chain graphs. In: Proceedings of UAI Conference, pp. 40–48 (1996)
2. Colombo, D., Maathuis, M.H.: Order-independent constraint-based causal struc-
ture learning. J. Mach. Learn. Res. 15(1), 3741–3782 (2014)
3. Cox, D.R., Wermuth, N.: Linear dependencies represented by chain graphs. Stat.
Sci. 8(3), 204–218 (1993)
4. Drton, M.: Discrete chain graph models. Bernoulli 15(3), 736–753 (2009)
5. Frydenberg, M.: The chain graph Markov property. Scand. J. Stat. 17(4), 333–353
(1990)
6. Golumbic, M.: Algorithmic Graph Theory and Perfect Graphs. Academic Press,
New York (1980)
7. Javidian, M.A., Valtorta, M.: On the properties of MVR chain graphs. In: Work-
shop Proceedings of PGM Conference, pp. 13–24 (2018)
8. Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs
with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)
9. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Ann. Stat. 17(1), 31–57
(1989)
10. Ma, Z., Xie, X., Geng, Z.: Structural learning of chain graphs via decomposition.
J. Mach. Learn. Res. 9, 2847–2880 (2008)
11. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1988)
12. Pearl, J.: Causality: Models, Reasoning, and Inference. Cambridge University
Press, Cambridge (2009)
338 M. A. Javidian et al.

13. Peña, J.M.: Alternative Markov and causal properties for acyclic directed mixed
graphs. In: Proceedings of UAI Conference, pp. 577–586 (2016)
14. Ramsey, J., Spirtes, P., Zhang, J.: Adjacency-faithfulness and conservative causal
inference. In: Proceedings of UAI Conference, pp. 401–408 (2006)
15. Richardson, T.S.: Markov properties for acyclic directed mixed graphs. Scand. J.
Stat. 30(1), 145–157 (2003)
16. Sonntag, D., Peña, J.M.: Learning multivariate regression chain graphs under faith-
fulness. In: Proceedings of PGM Workshop, pp. 299–306 (2012)
17. Sonntag, D., Peña, J.M.: Chain graph interpretations and their relations revisited.
Int. J. Approx. Reason. 58, 39–56 (2015)
18. Sonntag, D., Peña, J.M., Gómez-Olmedo, M.: Approximate counting of graphical
models via MCMC revisited. Int. J. Intell. Syst. 30(3), 384–420 (2015)
19. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction and Search, 2nd edn.
MIT Press, Cambridge (2000)
20. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian
network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
21. Wermuth, N., Sadeghi, K.: Sequences of regressions and their independences. Test
21, 215–252 (2012)
Comparison of Analogy-Based Methods
for Predicting Preferences

Myriam Bounhas1,2(B) , Marc Pirlot3 , Henri Prade4 , and Olivier Sobrie3


1
Emirates College of Technology, Abu Dhabi, UAE
2
LARODEC Lab, ISG de Tunis, Tunis, Tunisia
myriam [email protected]
3
Faculté Polytechnique, Université de Mons, Mons, Belgium
[email protected], [email protected]
4
IRIT, Université Paul Sabatier, 118 route de Narbonne,
31062 Toulouse cedex 09, France
[email protected]

Abstract. Given a set of preferences between items taken by pairs and


described in terms of nominal or numerical attribute values, the prob-
lem considered is to predict the preference between the items of a new
pair. The paper proposes and compares two approaches based on ana-
logical proportions, which are statements of the form “a is to b as c is to
d”. The first one uses triples of pairs of items for which preferences are
known and which make analogical proportions, altogether with the new
pair. These proportions express attribute by attribute that the change
of values between the items of the first two pairs is the same as between
the last two pairs. This provides a basis for predicting the preference
associated with the fourth pair, also making sure that no contradictory
trade-offs are created. Moreover, we also consider the option that one
of the pairs in the triples is taken as a k-nearest neighbor of the new
pair. The second approach exploits pairs of compared items one by one:
for predicting the preference between two items, one looks for another
pair of items for which the preference is known such that, attribute by
attribute, the change between the elements of the first pair is the same
as between the elements of the second pair. As discussed in the paper,
the two approaches agree with the postulates underlying weighted aver-
ages and more general multiple criteria aggregation models. The paper
proposes new algorithms for implementing these methods. The reported
experiments, both on real data sets and on generated datasets suggest
the effectiveness of the approaches. We also compare with predictions
given by weighted sums compatible with the data, and obtained by lin-
ear programming.

1 Introduction
Predicting preferences has become a challenging topic in artificial intelligence,
e.g., [9]. The idea of applying analogical proportion-based inference to this prob-
lem has been recently proposed [11] and different approaches have been suc-
cessfully tested [1,7], following previous studies that obtained good results in
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 339–354, 2019.
https://doi.org/10.1007/978-3-030-35514-2_25
340 M. Bounhas et al.

classification [3,10]. Analogical proportions are statements of the form “a is to


b as c is to d”, and express that the change (if any) between items a and b is
the same as the one between c and d. Analogical inference relies on the idea
that when analogical proportions hold, some related one may hold as well. Inter-
estingly enough, analogical inference may work with rather small amounts of
examples. There are two ways of making an analogical reading of preferences
between items (taken by pairs), which lead to different prediction methods. In
this paper, we explain the two analogical readings in Sect. 2 and how they give
birth to new prediction algorithms in Sect. 3. They are compared in Sect. 4 on
benchmarks with the two existing implementations, and with a linear program-
ming method when applicable.

2 Analogy and Linear Utility

Analogical proportions are statements of the form “a is to b as c is to d”, often


denoted a : b :: c : d. As numerical proportions, they are quaternary relations that
are supposed to satisfy the following postulates: (i) a : b :: a : b holds (reflexivity);
(ii) if a : b :: c : d holds, c : d :: a : b holds (symmetry); (iii) if a : b :: c : d holds,
a : c :: b : d holds (central permutation). When a : b :: c : d holds, it expresses
that “a differs from b as c differs from d and b differs from a as d differs from
c”. This translates into a Boolean logical expression (see, e.g., [12–14]),

a : b :: c : d = ((a ∧ ¬b) ≡ (c ∧ ¬d)) ∧ ((b ∧ ¬a) ≡ (d ∧ ¬c)).


a : b :: c : d is true only for the 6 following patterns (0, 0, 0, 0), (1, 1, 1, 1),
(1, 0, 1, 0), (0, 1, 0, 1), (1, 1, 0, 0), and (0, 0, 1, 1) for (a, b, c, d). This can be gener-
alized to nominal values; then a : b :: c : d holds true if and only if abcd is one of
the following patterns ssss, stst, or sstt where s and t are two possible distinct
values of items a, b, c and d.
Analogical proportions extends to vectors describing items in terms of
attribute values such as a = (a1 , ..., an ), by stating a : b :: c : d iff ∀i ∈ [[1, n]], ai :
bi :: ci : di .

The basic analogical inference pattern applied to compared items is then, ∀i ∈


[[1, n]],
a1i : b1i :: c1i : d1i and a2i : b2i :: c2i : d2i
a1  a2
b1  b2
c 1  c2
−−−−−−−−−−−−−−−
d1  d2 .
where x  y expresses that y is preferred to x (equivalently y  x); other
patterns (equivalent up to some rewriting) exist where, e.g.,  is changed into
 for (i) pairs (b1 , b2 ) and (d1 , d2 ), or in (ii) pairs (c1 , c2 ) and (d1 , d2 ) (since
a : b :: c : d is stable under central permutation) [1].
Comparison of Analogy-Based Methods for Predicting Preferences 341

Following [11], the above pattern corresponds to a vertical reading, while


another pattern corresponding to the horizontal reading can be stated as follows
∀i ∈ [[1, n]],
ai : bi :: ci : di
a  b,
− − − − −−
c  d.
The intuition behind the second pattern is simple: since a differs from b as
c differs from d (and vice-versa), and b is preferred to a, d should be preferred
to c as well. The first pattern, which involves more items and more preferences,
states that since the pair of items (d1 , d2 ) makes an analogical proportion with
the three other pairs (a1 , a2 ), (b1 , b2 ), (c1 , c2 ), then the preference relation that
holds for the 3 first pairs should hold as well for the fourth one.
Besides, the structure of the first pattern follows the axiomatics of additive
utility functions, for which contradictory trade-offs are forbidden, namely: if ∀i, j,

a1 −i α  a2 −i β

a1 −i γ  a2 −i δ
c1 −j α  c2 −j β
one cannot have:
c1 −j γ ≺ c2 −j δ
where x−i denotes the n-1-dimensional vector made of the evaluations of x on all
criteria except the ith one for which the Greek letter denotes the substituted value.
This property ensures that the differences of preference between α and β, on the
one hand, and between γ and δ, on the other hand, can consistently be compared.
Thus, when applying the first pattern, one may also make sure that no con-
tradictory trade-offs are introduced by the prediction mechanism. In the first
pattern, analogical reasoning amounts here to finding triples of pairs of com-
pared items (a, b, c) appropriate for inferring the missing value(s) in d. When
there exist several suitable triples, possibly leading to different conclusions, one
may use a majority vote for concluding.
Analogical proportions can be extended to numerical values, once the values
are renormalized on scale [0, 1], by a multiple-valued logic expression. The main
option, which agrees with the Boolean case, and where truth is a matter of degree
[6] is:
A(a, b, c, d) = 1 − |(a − b) − (c − d)| if a ≥ b and c ≥ d, or a ≤ b and c ≤ d
= 1 − max(|a − b|, |c − d|) otherwise.
Note that A(a, b, c, d) = 1 iff a − b = c − d.
We can then compute to what extent an analogical proportion holds between
vectors:
Σn A(ai , bi , ci , di )
A(a, b, c, d) = i=1 (1)
n
Lastly, let us remark that the second, simpler, pattern agrees with the view
that preferences wrt each criterion are represented  by differences of evaluations.
This includes the weighted sum, namely b  a iff i=1,n wi (bi − ai ) ≥ 0, while
342 M. Bounhas et al.

analogy holds at degree 1 iff ∀i ∈ [[1, n]], bi − ai = di − ci . This pattern does


not agree with more general models of additive utility functions, while the first
pattern is compatible with more general preference models.

3 Analogy-Based Preference Learning


In order to study the ability of analogical proportions to predict new preference
relations from a given set of such relations, while avoiding the generation of con-
tradictory trade-offs, we propose different “Analogy-based Preference Learning”
algorithms (AP L algorithms for short). The criteria are assumed to be evalu-
ated on a scale S = {1, 2, ..., k}. Let E = {ej : xj  y j } be a set of preference
examples, where  is a preference relation telling us that choice xj is preferred
to choice y j .

3.1 Methodology
Given a new pair of items d = (d1 , d2 ) for which preference is to be predicted, we
present two types of algorithms for predicting preferences in the following, corre-
sponding respectively to the “vertical reading” (first pattern) that exploits triples
of pairs of items, and to “horizontal reading” (second pattern) where pairs of items
are taken one by one. This leads to algorithms AP L3 and AP L1 respectively.
APL3 : The basic principle of AP L3 is to find triples t(a, b, c) of examples in E 3
that form with d either the non-contradictory trade-offs pattern (considered in
first), or the analogical proportion-based inference pattern.
For each triple t(a, b, c), we compute an analogical score At (a, b, c, d) that
estimates the extent to which it is in analogy with the item d using Formula 1.
Then to guess the final preference of d, for each possible solution, we first cumulate
these atomic scores provided by each of these triples in favor of this solution and
finally we assign to d the solution with the highest score. In case of ties, a majority
vote is applied.
The AP L3 can be described by this basic process:

– For a given d whose preference is to be predicted.


– Search for solvable triples t ∈ E 3 that make the analogical proportion, linking
the 4 preference relations of the triple elements with d, valid (the preference
relation between the 4 items satisfy one of the vertical pattern given in Sect. 2).
– For each triple t, compute the analogical score At (a, b, c, d).
– Compute the sum of these scores for each possible solution for d and assign to
d, the solution with the highest score.

APL1 : Applying the “horizontal reading” (second pattern), we consider only one
item a at a time and apply a comparison with d in terms of pairs of vectors rather
than comparing simultaneously 4 preferences, as with the first pattern. From a
preference a : a1  a2 such that (a1 , a2 , d1 , d2 ) is in analogical proportion, one
extrapolates that the same preference still holds for d : d1  d2 . A similar process
is applied in [7] that they called analogical transfer of preferences. A comparison
Comparison of Analogy-Based Methods for Predicting Preferences 343

of AP L1 with the algorithm they recently proposed is presented in the subsection


after the next one.
Following this logic, for each item a in the training set, an analogical score
A(a1 , a2 , d1 , d2 ) is computed. As in case of the vertical reading, these atomic
scores are accumulated for each possible solution for d (induced from items a).
Finally, the solution with the highest score is assigned to d.
The AP L1 can be described by this basic process:

– For a given d : d1 , d2 whose preference is to be predicted.


– For each item a ∈ E, compute the analogical score A(a1 , a2 , d1 , d2 ).
– Compute the sum of these scores for each possible solution for d and assign to
d, the solution with the highest score.

3.2 Algorithms

Based on the above ideas, we propose two different algorithms for predicting
preferences. Let E be a training set of examples whose preference is known.
Algorithms 1 and 2 respectively describe the two previously introduced

Algorithm 1. AP L3
Input: a training set E of examples with known preferences
a new item d ∈ / E whose preference P (d) is unknown.
SumA(p)=0 for each p ∈ {, }
BestAt =0, S = ∅, BestSol = ∅
for each triple t = (a, b, c) in E 3 do
S = FindCandidateTriples(t)
for each candidate triple ct = (a  , b  , c  ) in S do
if (P (a  ) : P (b  ) :: P (c  ) : x has solution p) then
At = M in(A(a1 , b1 , c1 , d1 ), A(a2 , b2 , c2 , d2 ))
if (At > BestAt ) then
BestAt = At
BestSol = Sol(ct)
end if
end if
end for
SumA(BestSol)+ = BestAt
end for
maxi = max{SumA}
if (maxi = 0) then
if (unique(maxi, SumA)) then
P (d) = argmaxp {SumA}
else
Majority vote
end if
else
No Prediction
end if
return P (d)
344 M. Bounhas et al.

procedures AP L3 and AP L1 . Note that in Algorithm 1, to evaluate the analogical


score At (a, b, c, d) for each triple t, we choose to consider all the possible arrange-
ments of items a, b and c, i.e., for each item x, both x : x1  x2 and x : x2  x1
are to be evaluated. The function F indCandidateT riples(t) helps to find such
candidate triples. Since we are dealing with triples in this algorithm, 23 candidate
triples are evaluated for each triple t. The final score for t is that corresponding
to the best score among its candidate triples. In both Algorithms 1 and 2, P (x)
returns the preference sign of the preference relation for x. For AP L3 , we also
consider another alternative in order to drastically reduce the number of triples
to be investigated. This alternative follows exactly the same process described
by Algorithm 1 except one difference: instead of systematically surveying E 3 , we
restrict the search for solvable triples t(a, b, c) by constraining c to be one of the
k-nearest neighbors of d w.r.t. Manhattan distance (k is a parameter to be tuned).
This option allows us to decrease the complexity of AP L3 that become quadratic
instead of being cubic. A similar approach [3] showed good efficiency for classifying
nominal or numerical data. We denote AP L3 (N N ) the algorithm corresponding
to this alternative.

Algorithm 2. AP L1
Input: a training set E of examples with known preferences
a new item d ∈ / E whose preference P (d) is unknown.
SumA(p)=0 for each p ∈ {, }
BestA=0, BestSol = ∅
for each a in E do
BestA = max(A(a1 , a2 , d1 , d2 ), A(a2 , a1 , d1 , d2 ))
if (A(a1 , a2 , d1 , d2 ) > A(a2 , a1 , d1 , d2 )) then
BestSol = P (a)
else
BestSol = notP (a)
end if
SumA(BestSol)+ = BestA
end for
maxi = max{SumA}
if (maxi = 0) then
if (unique(maxi, SumA)) then
P (d) = argmaxp {SumA}
else
Majority vote
end if
else
No Prediction
end if
return P (d)
Comparison of Analogy-Based Methods for Predicting Preferences 345

3.3 Related Work and Complexity

A panoply of research works has been developed to deal with preference learning
problems, see, e.g., [8]. The goal of most of these works is to predict a total order
function that agrees with a given preference relation. See, e.g., Cohen et al. [5]
that developed a greedy ordering algorithm to build a total order on the input
preferences given by an expert, and also suggest an approach to linearly combine
a set of preferences functions. However, we are only intended to predict preferences
relations in this paper and not a total order on preferences.
Even if the proposed approaches for predicting preferences may also look sim-
ilar to the recommender systems models, these latter address problems that are
somewhat different from ours. Observe that, in general, prediction recommender
systems is based on examples where items are associated with absolute grades
(e.g., from 1 to 5); namely examples are not made of comparisons between pairs
of items as in our case.
To the best of our knowledge, the only approach also aiming at predicting pref-
erences on an analogical proportion basis is the recent paper [7], which only inves-
tigates “the horizontal reading” of preference relations, leaving aside a prelimi-
nary version of “the vertical reading” [1]. The algorithms proposed in [1] only use
the Boolean setting of analogical proportions, while the approach presented here
deals with the multiple-valued setting (also applied in [7]). Moreover, Algorithm 2
in [1] used a set of preference examples completed by monotony in case no triples
satisfying analogical proportion could be found, while in this work, this issue was
simply solved by selecting triples satisfying the analogical proportions with some
degree as suggested in the presentation of the multiple-valued extension in Sect. 2.
That’s why we compare deeply our proposed analogical proportions algorithms to
this last work [7]. APL algorithms differ from this approach in two ways.
First, the focus of [7] is on learning to rank user preferences based on the eval-
uation of a loss function, while our focus is on predicting preferences (rather than
getting a ranking) evaluated with the error rate of predictions.
Lastly, although Algorithm 1 in [7] may be useful for predicting preferences in
the same way as our AP L1 , the key difference between these two algorithms is that
AP L1 exploits the summation of all valid analogical proportions for each possible
solution for d to be predicted, while Algorithm 1 in [7] computes all valid analogi-
cal proportions, and then considers only the N most relevant ones for prediction,
i.e., those having the largest analogical scores. To select such N best scores, a total
order on valid proportions is required for each item d to be predicted which may
seem computationally costly for a large number of items, as noted in [7], while no
ordering on analogical proportions is required in the proposed AP L1 .
To compare our AP L algorithms to Algorithm 1 in [7], suitable for predicting
preferences, we also re-implemented the latter as described in their paper (without
considering their Algorithm 2). We also tuned the parameter N with the same
input values as fixed by the authors [7].
In terms of complexity, due to the use of triples of pairs of items in AP L3 , the
algorithm has a cubic complexity while Algorithm AP L3 (N N ) is quadratic. Both
AP L1 and Algorithm 1 in [7] are linear, even if Algorithm 1 is slightly computa-
tionally more costly due to the ordering process.
346 M. Bounhas et al.

4 Experimentations
To evaluate the proposed APL algorithms, we have developed a set of experiments
that we describe in the following.

4.1 Datasets
The experimental study is based on five datasets, the two first ones are synthetic
data generated from different functions: weighted average, Tversky’s additive dif-
ference and Sugeno Integral described in the following. For each dataset, any pos-
sible combination of the feature values over the scale S is associated with the pref-
erence relation.

– Datasets 1: we consider only 3 criteria in each preference relation i.e., n = 3.


We generate different type of datasets:
1. Examples in this dataset are first generated using a weighted average func-
tion (denoted W A in Table 2) with 0.6, 0.3, 0.1 weights respectively for cri-
teria 1, 2 and 3.
2. The second artificial dataset (denoted T V in Table 2) is generated using a
Tversky’sadditive difference model [15], i.e. an alternative a is preferred
n
over b if i=1 Φi (ai − bi ) ≥ 0, where Φi are increasing and odd real-valued
functions. For generating this dataset, we used the piecewise linear func-
tions given in appendix A.
3. Then, we generate this dataset using weighted max and weighted min
which are particular cases of Sugeno integrals, namely using the aggrega-
tion functions defined as follows:
n
SM ax = max(min(vi , wi )),
i=1

n
SM in = min(max(vi , 6 − wi )),
i=1

where vi refers to the value of criterion i and wi represents its weight. In this
case, we tried two different sets of weights : w1 = 5, 4, 2 and w2 = 5, 3, 3,
respectively for criteria 1, 2 and 3.
– Datasets 2: we expand each preference relation to support 5 criteria, i.e:
n = 5. We apply the weights 0.4, 0.3, 0.1, 0.1, 0.1 in case of weighted average
function and w1 = 5, 4, 3, 2, 1 and w2 = 5, 4, 4, 2, 2 in case of Sugeno integral
functions. For generating the second dataset (TV), we used the following piece-
wise linear functions given in Appendix A. For the two datasets, weights are
fixed on a empirical basis, although other choices have been tested and have
led to similar results.

We limit ourselves to 5 criteria since it is already a rather high number of cri-


teria for the cognitive appraisal of an item by a human user in practice. For both
datasets, each criterion is evaluated on a scale with 5 levels, i.e., S = {1, ..., 5}.
Comparison of Analogy-Based Methods for Predicting Preferences 347

To check the applicability of APL algorithms, it is important to measure their


efficiency on real data. For our experiments, data should be collected as pairs of
choices/options for which a human is supposed to pick one of them. To the best
of our knowledge, there is no such a dataset that is available in this format [4].
Note that in the following datasets, a user only provides an overall rating score for
each choice. Instead, we first pre-process these datasets to generate the preferences
into the needed format. For any two inputs with different ratings, we generate a
preference relation. We use the three following datasets:

– The Food dataset (https://github.com/trungngv/gpfm) contains 4036 user


preferences among 20 food menus picked by 212 users. Features represent 3
levels of user hunger; the study is restricted to 5 different foods.
– The University dataset (www.cwur.org) includes the top 100 universities
from the world for 2017 with 9 numerical features such as national rank, quality
of education, etc.
– The Movie-Lens dataset (https://grouplens.org) includes users responses in
a survey on how serendipitous a particular movie was to them. It contains 2150
user preferences among different movies picked by different users.

Table 1 gives a summary of the datasets characteristics. In order to apply the


multiple-valued definition of analogy, all numerical attributes are rescaled. Each
x−xmin
numerical feature x is replaced by xmax −xmin , where xmin and xmax respectively
represent the minimal and the maximal values for this feature computed using the
training set only.

4.2 Validation Protocol

In terms of protocol, we apply a standard 10 fold cross-validation technique. To


tune the parameter k of AP L3 (N N ) as well as parameter N for Algorithm 1 in
[7], for each fold, we only keep the corresponding training set and we perform
again a 5-fold cross-validation with diverse values of the parameters. We consider
k ∈ {10, 15, ..., 30} and we keep the same optimization values for N fixed by the
authors for a fair comparison with [7], i.e., N ∈ {10, 15, 20}. We then select the
parameter values providing the best accuracy. These tuned parameters are then

Table 1. Datasets description

Dataset Features Ordinal Binary Numeric Instances


Dataset1 3 3 – – 200
Dataset2 5 5 – – 200
Food 4 4 – – 200
University 9 – – 9 200
Movie-lens 17 9 8 – 200
348 M. Bounhas et al.

used to perform the initial cross-validation. We run each algorithm (with the pre-
vious procedure) 10 times. Accuracies and parameters shown in Table 2 are the
average values over the 10 different values (one for each run).

4.3 Results
Tables 2 and 3 provide prediction accuracies respectively for synthetic and real
datasets for the three proposed AP L algorithms as well as Algorithm 1 described
in [7] (denoted here “FH18”). The best accuracies for each dataset size are high-
lighted in bold.
If we analyze results in Tables 2 and 3 we can conclude that:
– For synthetic data and in case of datasets generated from a weighted average, it
is clear that AP L3 achieves the best performances for almost all dataset sizes.
AP L1 is just after. Note that these two algorithms record all triples/items ana-
logical scores for prediction. We may think that it is better to use all the train-
ing set for prediction to be compatible with weighted average examples.
– In case of datasets generated from a Sugeno integral, AP L1 is significantly bet-
ter than other algorithms for most datasets sizes and for the two weights W1
and W2 .
– If we compare results of the three types of datasets: the one generated from a
weighted average, from Tversky’s additive difference or from a Sugeno integral,
globally, we can see that the accuracy obtained for a Sugeno integral dataset
is the best in case of datasets with 3 criteria (see for example AP L1). For
datasets with 5 criteria, results obtained on weighted average datasets are bet-
ter than on the two others. While results obtained for Tversky’s additive dif-
ference datasets seem less accurate in most cases.
– For real datasets, it appears that AP L3 (N N ) is the best predictor for most
tested datasets. To predict user preferences, rather than using all the training
set for prediction, we can select a set of training examples, those where one of
them is among the k-nearest neighbors.
– AP L3 (N N ) seems less efficient in case of synthetic datasets. This is due to the
fact that synthetic data is generated randomly and applying the NN-approach
is less suitable in such cases.
– If we compare AP L algorithms to Algorithm1 “FH18”, we can see that AP L3
outperforms the latter in case of synthetic datasets. Moreover, AP L3 (N N ) is
better than “FH18” in case of real datasets.
– AP L algorithms achieve the same accuracy as Algorithm2 in [1] with a very
small dataset size (for the dataset with 5 criteria built from a weighted average,
only 200 examples are used by AP L algorithms instead of 1000 examples in [1]
to achieve the best accuracy). The two algorithms have close results for the
Food dataset.
– Comparing vertical and horizontal approaches, there is no clear superiority
of one view (for the tested datasets).
To better investigate the difference between the two best algorithms noted in
the previous study, we also develop a pairwise comparison at the instance level
Comparison of Analogy-Based Methods for Predicting Preferences 349

Table 2. Prediction accuracies for Dataset 1 and Dataset 2

Data Size APL3 APL3(NN) k∗ APL1 FH18 N∗


D1 50 WA 92.4 ± 11.94 89.0 ± 12.72 22 92.9 ± 11.03 90.6 ± 12.61 14
TV 87.8 ± 12.86 88.6 ± 12.71 19 86.8 ± 14.28 90.4 ± 12.61 11
SMax-W1 90.0 ± 12.24 88.8 ± 12.01 19 90.2 ± 11.66 88.0 ± 13.02 15
SMax-W2 91.2 ± 12.91 88.4 ± 13.59 20 95.8 ± 7.92 90.2 ± 11.62 15
SMin-W1 90.0 ± 13.24 89.4 ± 11.97 17 88.6 ± 13.18 86.4 ± 13.76 16
SMin-W2 93.2 ± 10.37 89.6 ± 12.61 20 94.2 ± 9.86 92.4 ± 11.48 18
100 WA 94.75 ± 6.79 91.55 ± 8.51 24 93.85 ± 7.51 93.5 ± 7.09 15
TV 89.6 ± 9.43 86.5 ± 10.41 21 88.0 ± 9.6 92.1 ± 8.45 15
SMax-W1 93.4 ± 6.5 91.2 ± 9.31 25 93.2 ± 6.98 91.2 ± 9.14 17
SMax-W2 93.8 ± 7.17 88.8 ± 8.08 25 95.3 ± 5.66 92.9 ± 8.2 15
SMin-W1 90.4 ± 9.66 89.6 ± 8.57 19 90.6 ± 9.43 89.4 ± 8.73 16
SMin-W2 94.3 ± 6.56 90.1 ± 7.68 21 96.3 ± 5.09 94.6 ± 5.68 14
200 WA 95.25 ± 4.94 92.25 ± 6.28 23 95.4 ± 4.33 94.55 ± 5.17 13
TV 91.25 ± 5.48 91.7 ± 5.93 25 90.1 ± 5.18 95.1 ± 4.53 13
SMax-W1 90.0 ± 5.98 89.3 ± 6.73 24 90.2 ± 6.24 89.4 ± 6.38 14
SMax-W2 95.7 ± 3.28 92.6 ± 5.35 26 97.2 ± 2.71 95.6 ± 4.08 15
SMin-W1 92.0 ± 7.17 91.1 ± 5.86 24 92.0 ± 5.82 90.3 ± 6.38 18
SMin-W2 94.8 ± 4.88 91.55 ± 5.15 26 97.4 ± 3.35 95.0 ± 4.45 16
D2 50 WA 88.3 ± 12.41 86.2 ± 14.43 20 84.9 ± 15.17 83.5 ± 14.99 15
TV 89.4 ± 15.08 86.0 ± 16.46 17 89.2 ± 14.85 87.6 ± 14.44 17
SMax-W1 86.6 ± 13.77 83.8 ± 14.14 18 86.4 ± 12.18 83.2 ± 15.53 16
SMax-W2 85.8 ± 15.77 82.4 ± 15.44 20 87.0 ± 14.58 81.2 ± 15.7 14
SMin-W1 86.6 ± 14.34 86.2 ± 13.45 23 85.6 ± 14.61 84.0 ± 14.57 16
SMin-W2 88.8 ± 11.5 86.2 ± 14.29 22 89.0 ± 11.94 83.6 ± 14.78 17
100 WA 92.0 ± 7.51 89.0 ± 8.89 22 90.0 ± 8.22 88.3 ± 9.71 15
TV 90.2 ± 8.18 88.1 ± 8.93 18 91.4 ± 8.03 86.8 ± 9.06 15
SMax-W1 88.0 ± 9.52 87.9 ± 9.56 21 88.8 ± 9.31 85.4 ± 10.7 16
SMax-W2 87.7 ± 9.17 86.1 ± 9.77 24 90.1 ± 9.73 85.1 ± 9.86 17
SMin-W1 89.1 ± 9.15 88.6 ± 10.47 21 90.2 ± 9.3 85.6 ± 9.98 16
SMin-W2 87.4 ± 9.7 83.2 ± 11.99 24 89.2 ± 9.27 84.8 ± 10.06 16
200 WA 94.7 ± 5.11 90.2 ± 6.01 24 92.3 ± 5.2 90.0 ± 6.08 17
TV 91.1 ± 6.33 89.0 ± 6.54 27 91.9 ± 6.06 89.65 ± 6.81 16
SMax-W1 89.15 ± 6.76 88.65 ± 6.58 22 89.7 ± 6.84 86.5 ± 7.66 17
SMax-W2 89.4 ± 6.45 88.15 ± 6.77 27 91.7 ± 5.65 86.65 ± 6.71 16
SMin-W1 90.7 ± 5.79 89.25 ± 6.45 24 91.3 ± 5.1 88.8 ± 6.55 15
SMin-W2 89.75 ± 5.64 88.0 ± 6.33 23 91.55 ± 5.68 87.8 ± 6.68 16
350 M. Bounhas et al.

Table 3. Prediction accuracies for real datasets

Dataset Size APL3 APL3(NN) k∗ APL1 FH18 N∗


Food 200 61.3 ± 8.32 63.0 ± 9.64 15 61.05 ± 9.34 57.55 ± 10.41 13
1000 – 73.16 ± 3.99 20 63.11 ± 5.0 63.11 ± 5.54 20
Univ. 200 73.6 ± 9.67 80.0 ± 8.03 14 73.6 ± 8.47 75.7 ± 8.29 12
1000 – 87.9 ± 3.04 17 76.76 ± 3.86 83.74 ± 3.26 12
Movie 200 51.9 ± 14.72 49.1 ± 15.2 19 52.93 ± 13.52 48.61 ± 14.26 15
1000 – 55.06 ± 4.51 23 54.48 ± 4.7 53.38 ± 5.32 10

between AP L3 and AP L1 in case of synthetic datasets and AP L3(N N ) and


AP L1 in case of real datasets. In this comparison, we want to check if the prefer-
ence d is predicted in the same way by the two algorithms. For this purpose, we
computed the frequency of the case where the two algorithms predict the correct
preference for d (this case is noted T T ), the frequency of the case where both algo-
rithms predict an incorrect preference (noted F F ), the frequency where AP L3 or
AP L3(N N ) prediction is correct and AP L1 prediction is wrong (T F ) and the fre-
quency where AP L3 or AP L3(N N ) prediction is wrong and AP L1 prediction is
correct (F T ). We only consider the datasets generated from a weighted average in
case of synthetic data in this new experimentation. Results are shown in Table 4.
Results in this table show that the two compared algorithms predict preferences
in the same way (the highest frequency is seen in column T T ) for most cases. If
we compare results in column T F and F T , it is clear that AP L3 (or AP L3(N N ))
is significantly better than AP L1 since the frequency of cases where AP L3 (or
AP L3(N N )) yields the correct prediction while AP L1 doesn’t (column T F ) is
higher than the opposite case (column F T ). These results confirm our previous
study.

Table 4. Frequency of correctly/incorrectly predicted preferences d predicted same/not


same by the two compared algorithms

Datasets Size TT TF FT FF
D1 100 0.96 0.02 0.01 0.01
200 0.93 0.02 0.02 0.03
D2 100 0.84 0.07 0.05 0.04
200 0.9 0.04 0.01 0.05
Food 1000 0.528 0.18 0.106 0.186
Univ. 1000 0.735 0.158 0.033 0.074
Movie 1000 0.364 0.189 0.17 0.277

In Tables 5 and 6, we compare the best results obtained with our algorithms to
the accuracies obtained by finding the weighted sum that best fits the data in the
Comparison of Analogy-Based Methods for Predicting Preferences 351

learning sets. The weights are found by using linear programming as explained in
Appendix B.
Table 5 displays the results obtained using the synthetic datasets generated
according to Tverski’s model. The results for datasets generated by a weighted
average or a Sugeno integral are not reproduced in this table because the weighted
sum (WSUM) almost always reaches an accuracy of 100%. Only in three cases on
thirty, its accuracy is slightly lower, with a worst performance of 97.5%. This is not
unexpected for datasets generated by means of a weighted average, since WSUM is
the right model in this case. It is more surprising for data generated by a Sugeno
integral (even if we have only dealt here with particular cases), but we get here
some empirical evidence that the Sugeno integral can be well-approximated by a
weighted sum. The results are quite different for datasets generated by Tversky’s
model. WSUM shows the best accuracy in two cases; APL1 and APL3, also in
two cases each. Tversky’s model does not lead to transitive preference relations,
in general, and this may be detrimental to WSUM that models transitive relations.

Table 5. Prediction accuracies for artificial datasets generated by the Tverski model

Dataset Size APL3 APL1 WSUM


TV (3 features) 50 87.8 ± 12.86 86.8 ± 14.28 82.00 ± 17.41
100 89.6 ± 9.43 88.0 ± 9.6 93.00 ± 8.23
200 81.25 ± 5.48 90.1 ± 5.18 91.00 ± 6.43
TV (5 features) 50 89.4 ± 15.08 89.2 ± 14.85 84.00 ± 15.78
100 90.2 ± 8.18 91.4 ± 8.03 87.00 ± 9.49
200 91.1 ± 6.33 91.9 ± 6.06 85.50 ± 6.53

Table 6 compares the accuracies obtained with the real datasets. WSUM yields
the best results for all datasets except for the “Food” dataset, size 1000.

Table 6. Prediction accuracies for real datasets

Dataset Size APL3(NN) APL1 WSUM


Food 200 63.0 ± 9.64 61.05 ± 9.34 64.00 ± 20.11
1000 73.16 ± 3.99 63.11 ± 5.0 61.10 ± 10.19
Univ. 200 80.0 ± 8.03 73.6 ± 8.47 99.50 ± 1.58
1000 87.9 ± 3.04 76.76 ± 3.86 88.70 ± 21.43
Movie 200 49.1 ± 15.2 52.93 ± 13.52 69.50 ± 18.77
1000 55.06 ± 4.51 54.48 ± 4.7 77.60 ± 16.93

These examples suggest that analogy-based algorithms may surpass WSUM


in some cases. However, the type of datasets for which it takes place is still to be
determined.
352 M. Bounhas et al.

5 Conclusion
The results presented in the previous section confirm the interest of considering
analogical proportions for predicting preferences, which was the primary goal of
this paper since such an approach has been proposed only recently. We observed
that analogical proportions yield a better accuracy as compared to a weighted sum
model for certain datasets (TV, among the synthetic datasets and Food, as a real
dataset). Determining for which datasets this tends to be the case requires further
investigation.
Analogical proportions may be a tool of interest for creating artificial examples
that are useful for enlarging a training set, see, e.g., [2]. It would be worth investi-
gating to see if such enlarged datasets could benefit to analogy-based preference
learning algorithms as well as to the ones based on weighted sum.

Acknowledgements. This work was partially supported by ANR-11-LABX-0040-


CIMI (Centre Inter. de Math. et d’Informatique) within the program ANR-11-IDEX-
0002-02, project ISIPA.

A Tversky’s Additive Difference Model

Tversky’s additive difference model functions used in the experiments are given
below. Let d1 , d2 be a pair of alternative that have to be compared. We denote by
ηi the difference between d1 and d2 on the criterion i, i.e. ηi = d1i − d2i . For the
TV dataset in which 3 features are involved, we used the following piecewise linear
functions:

⎪sgn(η1 ) 0.453 · 0.143 · η1


if |η1 | ∈ [0, 0.25],
⎨sgn(η ) 0.453 · [−0.168 + 0.815 · η ] if |η | ∈ [0.25, 0.5],
1 1 1
Φ1 (η1 ) =
⎪sgn(η1 ) 0.453 · [0.230 + 0.018 · η1 ]
⎪ if |η 1 | ∈ [0.5, 0.75],


sgn(η1 ) 0.453 · [−2.024 + 3.024 · η1 ] if |η1 | ∈ [0.75, 1],


⎪ sgn(η2 ) 0.053 · 2.648 · η2 if |η2 | ∈ [0, 0.25],

⎨sgn(η ) 0.053 · [0.371 + 1.163 · η ] if |η | ∈ [0.25, 0.5],
2 2 2
Φ2 (η2 ) =

⎪ sgn(η ) 0.053 · [0.926 + 0.054 · η ] if |η 2 | ∈ [0.5, 0.75],


2 2
sgn(η2 ) 0.053 · [0.866 + 0.134 · η2 ] if |η2 | ∈ [0.75, 1],


⎪ sgn(η3 ) 0.494 · 0.289 · η3 if |η3 | ∈ [0, 0.25],

⎨sgn(η ) 0.494 · [−0.197 + 1.076 · η ] if |η | ∈ [0.25, 0.5],
3 3 3
Φ3 (η3 ) =

⎪ sgn(η ) 0.494 · [0.150 + 0.383 · η ] if |η 3 | ∈ [0.5, 0.75],


3 3
sgn(η3 ) 0.494 · [−1.252 + 2.252 · η3 ] if |η3 | ∈ [0.75, 1].
Comparison of Analogy-Based Methods for Predicting Preferences 353

For the TV dataset in which 5 features are involved, we used the following
piecewise linear functions:


⎪ sgn(η1 ) · 0.294 · 2.510 · η1 if |η1 | ∈ [0, 0.25],

⎨sgn(η ) · 0.294 · [0.562 + 0.263 · η ]
1 1 if |η1 | ∈ [0.25, 0.5],
Φ1 (η1 ) =

⎪ sgn(η ) · 0.294 · [0.645 + 0.096 · η ] if |η1 | ∈ [0.5, 0.75],


1 1
sgn(η1 ) · 0.294 · [−0.130 + 1.130 · η1 ] if |η1 | ∈ [0.75, 1],


⎪ sgn(η2 ) · 0.151 · 0.125 · η2 if |η2 | ∈ [0, 0.25],

⎨sgn(η ) · 0.151 · [0.025 + 0.023 · η ]
2 2 if |η2 | ∈ [0.25, 0.5],
Φ2 (η2 ) =

⎪ sgn(η2 ) · 0.151 · [−0.545 + 1.164 · η2 ] if |η2 | ∈ [0.5, 0.75],


sgn(η2 ) · 0.151 · [−1.689 + 2.689 · η2 ] if |η2 | ∈ [0.75, 1],


⎪ sgn(η3 ) · 0.039 · 2.388 · η3 if |η3 | ∈ [0, 0.25],

⎨ sgn(η3 ) · 0.039 · [0.582 + 0.057 · η3 ] if |η3 | ∈ [0.25, 0.5],
Φ3 (η3 ) =

⎪ sgn(η3 ) · 0.039 · [−0.046 + 1.314 · η3 ] if |η3 | ∈ [0.5, 0.75],


sgn(η3 ) · 0.039 · [0.759 + 0.241 · η3 ] if |η3 | ∈ [0.75, 1],


⎪ sgn(η4 ) · 0.425 · 0.014 · η4 if |η4 | ∈ [0, 0.25],

⎨sgn(η ) · 0.425 · [−0.110 + 0.455 · η ] if |η | ∈ [0.25, 0.5],
4 4 4
Φ1 (η4 ) =

⎪ sgn(η ) · 0.425 · [−0.341 + 0.917 · η ] if |η 4 | ∈ [0.5, 0.75],


4 4
sgn(η4 ) · 0.425 · [−1.613 + 2.613 · η4 ] if |η4 | ∈ [0.75, 1].


⎪ sgn(η5 ) · 0.091 · 3.307 · η5 if |η5 | ∈ [0, 0.25],

⎨sgn(η ) · 0.091 · [0.697 + 0.519 · η ] if |η | ∈ [0.25, 0.5],
5 5 5
Φ1 (η5 ) =

⎪ sgn(η ) · 0.091 · [0.880 + 0.153 · η ] if |η 5 | ∈ [0.5, 0.75],


5 5
sgn(η5 ) · 0.091 · [0.979 + 0.021 · η5 ] if |η5 | ∈ [0.75, 1].

B Linear Program Used for Computing a Weighted Sum


We compared the performances of the algorithms presented in this paper to the
results obtained with a linear program inferring the parameters of a weighted sum
that fits as well as possible with the learning set. The linear program is given below:

n min a∈E δa
i=1 wi · (a1i − a2i ) + δa ≥ 0 ∀a ∈ E : a1  a2
n
i=1 wi · (ai − ai ) − δa ≤  ∀a ∈ E : a1 ≺ a2
1 2

wi ∈ [0, 1] i = 1, ..., n
δa ∈ [0, ∞[
with:
– n: number of features,
– E: learning set composed of pairs (a1 , a2 ) evaluated on n features and a pref-
erence relation for each pair (a1 a2 or a1 ≺ a2 ),
– wi : weight associated to feature i,
– : a small positive value.
354 M. Bounhas et al.

References
1. Bounhas, M., Pirlot, M., Prade, H.: Predicting preferences by means of analogi-
cal proportions. In: Cox, M.T., Funk, P., Begum, S. (eds.) ICCBR 2018. LNCS
(LNAI), vol. 11156, pp. 515–531. Springer, Cham (2018). https://doi.org/10.1007/
978-3-030-01081-2 34
2. Bounhas, M., Prade, H.: An analogical interpolation method for enlarging a training
dataset. In: BenAmor, N., Theobald, M. (eds.) Proceedings of 13th International
Conference on Scalable Uncertainty Management (SUM 2019), Compiègne, 16–18
December. LNCS, Springer (2019)
3. Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or
numerical data. Int. J. Approx. Reason. 91, 36–55 (2017)
4. Chen, S., Joachims, T.: Predicting matchups and preferences in context. In: Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge. Dis-
covery and Data Mining (KDD 2016), pp. 775–784. ACM (2016)
5. Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. CoRR
abs/1105.5464 (2011). http://arxiv.org/abs/1105.5464
6. Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical pro-
portions. Fuzzy Sets Syst. 292, 193–202 (2016)
7. Fahandar, M.A., Hüllermeier, E.: Learning to rank based on analogical reasoning.
In: Proceedings of 32nd National Conference on Artificial Intelligence (AAAI 2018),
New Orleans, 2–7 February 2018 (2018)
8. Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-14125-6
9. Hüllermeier, E., Fürnkranz, J.: Editorial: preference learning and ranking. Mach.
Learn. 93(2–3), 185–189 (2013)
10. Miclet, L., Bayoudh, S., Delhay, A.: Analogical dissimilarity: definition, algorithms
and two experiments in machine learning. JAIR 32, 793–824 (2008)
11. Pirlot, M., Prade, H., Richard, G.: Completing preferences by means of analogi-
cal proportions. In: Torra, V., Narukawa, Y., Navarro-Arribas, G., Yañez, C. (eds.)
MDAI 2016. LNCS (LNAI), vol. 9880, pp. 135–147. Springer, Cham (2016). https://
doi.org/10.1007/978-3-319-45656-0 12
12. Prade, H., Richard, G.: Homogeneous logical proportions: their uniqueness and their
role in similarity-based prediction. In: Brewka, G., Eiter, T., McIlraith, S.A. (eds.)
Proceedings of 13th International Conference on Principles of Knowledge Repre-
sentation and Reasoning (KR 2012), Roma, 10–14 June, pp. 402–412. AAAI Press
(2012)
13. Prade, H., Richard, G.: From analogical proportion to logical proportions. Log.
Univers. 7(4), 441–505 (2013)
14. Prade, H., Richard, G.: Analogical proportions: from equality to inequality. Int. J.
Approx. Reason. 101, 234–254 (2018)
15. Tversky, A.: Intransitivity of preferences. Psychol. Rev. 76, 31–48 (1969)
Using Convolutional Neural Network
in Cross-Domain Argumentation
Mining Framework

Rihab Bouslama1(B) , Raouia Ayachi2 , and Nahla Ben Amor1


1
LARODEC, ISG, Université de Tunis, Tunis, Tunisia
[email protected], [email protected]
2
ESSECT, LARODEC, Université de Tunis, Tunis, Tunisia
[email protected]

Abstract. Argument Mining has become a remarkable research area in


computational argumentation and Natural Language Processing fields.
Despite its importance, most of the current proposals are restricted
to a text type (e.g., Essays, web comments) on a specific domain and
fall behind expectations when applied to cross-domain data. This paper
presents a new framework for Argumentation Mining to detect argumen-
tative segments and their components automatically using Convolutional
Neural Network (CNN). We focus on both (1) argumentative sentence
detection and (2) argument components detection tasks. Based on differ-
ent corpora, we investigate the performance of CNN on both in-domain
level and cross-domain level. The investigation shows challenging results
in comparison with classic machine learning models.

Keywords: Argumentation Mining · Arguments · Convolutional


Neural Network · Deep learning

1 Introduction
An important sub-field of computational argumentation is Argumentation Min-
ing (AM) which aims to detect argumentative sentences and argument com-
ponents from different sources (e.g., online debates, social medias, persuasive
essays, forums). In the last decade, AM field has gained the interest of many
researchers due to its important impact in several domains [1,4,19]. AM process
can be divided into sub-tasks taking the form of a pipeline as proposed in [3].
They presented three sub-tasks namely, argumentative sentence detection, argu-
ment component boundary detection and argument structure prediction. Argu-
mentative sentence detection task is viewed as a classification problem where
argumentative sentences are classified into two classes (i.e., argumentative, not
argumentative). Argument component boundary detection is treated as a seg-
mentation problem and may be presented either as a multi-class classification
issue (i.e., classify each component) or as a binary classification issue (i.e., one
classifier for each component) solved using machine learning classifiers.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 355–367, 2019.
https://doi.org/10.1007/978-3-030-35514-2_26
356 R. Bouslama et al.

Interesting applications in the AM area were proposed. An argument search


engine was proposed in [19]. A web server for argumentation mining called MAR-
GOT was proposed in [4] which is an online platform for non-expert of the argu-
mentation domain. Considering the first two sub-tasks of AM (i.e., argument
detection and argument component detection), an important study presented in
[5] investigates different classifiers and different features in order to determine
the best classifier and the most important features to take into consideration.
Indeed, Classifiers’ performances depend on the model, the data and the features.
One of the most crucial steps, yet the most critical is the choice of the proper
features to use. Recently, deep learning techniques overcome this restraint and
present good results in text classification [8,9] which makes their exploration in
AM field interesting [6,12,14,24]. In addition, features make the generalization
of models over many corpora harder. Nonetheless, most of the existing argu-
ment mining approaches are limited to one specific domain (e.g., student essays
[1], online debates [2]). Thus, generalizing AM approaches over heterogeneous
corpora is still poorly explored.
On the other hand, Convolutional Neural Networks showed a considerable
success in computer vision [7], speech recognition [17] and computational lin-
guistics [8,9,11] where they have been proven to be competitive to traditional
models without any need of knowledge on syntactic or semantic structure of
a given language [8,10]. Considering the success of the CNN in text classifica-
tion and more precisely its success in sentiment analysis field [8], its use in AM
seems to give important results. Aker et al., [5] tested Kim’s model [8] to both
argumentative sentence detection and argument component boundary detection
tasks on two corpora namely, persuasive essays corpora and Wikipedia corpora.
Without any changes in the model, the results were significantly important.
Deep learning algorithms interested many researchers in AM field. In [21],
joint RNN was used to detect argument component boundary detection. In the
latter, the argument component detection task was considered as a sequence
labelling task while in this work we treat claim detection and premise detection
separately. In [24], the authors used CNN and LSTM (Long short-Term Memory)
to classify claims in online users comments. CNN was also used for bi-sequence
classification [12]. Recently, Hua et al., [6] proposed the use of CNN in arguments
detection in peers review. Moreover, in [14] the authors proposed the use of deep
learning networks, more specifically two variants of Kim’s CNN and LSTM to
identify claims in a cross-domain manner. Kim’s model [8] was tested in many
occasions in AM field. However, to the best of our knowledge, only one work
studied the use of Zhang’s character-based CNN model in AM [15]. This work
presents models to classify argument components in classrooms discussions in
order to automatically classify students’ utterances into claims, evidences and
warrants. Their results showed that convolutional networks (whether character
or word level) are more robust than recurrent networks.
Moreover, most of the current proposed approaches for argument mining
are designed to be used for specific text types and fall short when applied to
heterogeneous texts. Only few proposals treated the cross-domain case such as
the work of Ajjour et al. [22] where the authors studied the major parameters
Using CNN in Cross-Domain Argumentation Mining Framework 357

of unit segmentation systematically by exploring many features on a word-level


setting on both in-domain level and cross-domain level. Recently, in [23] a new
sentential annotation scheme that is reliably applicable by crowd workers to
arbitrary Web texts was proposed.
Following recent advances in both AM and Deep Learning fields, we propose a
cross-domain AM framework based on Convolutional Neural Networks, so-called
ArguWeb, able to provide up to date arguments from the web in a cross-domain
manner.
The rest of this paper is divided as follows: Sect. 2 presents the basic concepts
of CNN in text classification. Section 3 describes ArguWeb: a new framework for
argument mining from the web. Finally, Sect. 4 presents the conducted experi-
mental study, discusses the results and presents an illustrative example on the
use of ArguWeb framework.

2 Basics on Convolutional Neural Network

Convolutional neural networks (CNN) were originally developed for computer


vision. CNNs utilize layers with convolving filters [8]. Later on, CNNs were
adapted to Natural Language Processing (NLP) domain and showed remark-
able results [8,9,11]. These algorithms have the advantage of no needing to
syntactic and semantic knowledge. Kim et al., proposed a new version of CNN
which is word-based and uses word2vec technique.1 Another revolution in text
classification field is the work of Zhang et al., [9] where the authors investigate
the use of character-level convolutional networks instead of word-based con-
volutional networks (e.g., ngrams) which presented interesting results. Zhang
et al., [9] succeeded to adapt the CNN from dealing with signals either in image
or speech recognition to deal with characters and treat them as a kind of raw sig-
nals. Besides the advantage of no needing to syntactic and semantic knowledge,
character-level CNN presents other advantages:

– Good managing of misspelled and out-of-vocabulary words as they can be


easily learned.
– Ability to handle noisy data (especially texts extracted from forums and social
media).
– No need to the text pre-processing phase (e.g., tokenization, stemming).
– No need to define features.

Figure 1 shows an overview of convolutional neural networks architecture.


Although both word-based and char-level CNNs have similar global architec-
tures, they differ in some details such as the input text representation and
the number of convolutional and fully connected layers. Indeed, Character-level
CNNs are based on temporal convolutional model that computes 1-D convolution
and temporal max pooling which is a 1-D version of max pooling in computer
vision [9]. It is based on a threshold function which is similar to Rectified Linear
1
https://code.google.com/p/word2vec/.
358 R. Bouslama et al.

ConvoluƟon Fully connected


layer 1 layer 1

Input Text ConvoluƟon Fully connected


Pooling Output
RepresentaƟon layer 2 layer 2

ConvoluƟon Fully connected


layer n layer n

Fig. 1. An overview on the Convolutional Neural Network.

Units (ReLu) using stochastic gradient descent (SGD) with a minibatch of size
128. The input of the model is a sequence of encoded characters using one-hot
encoding. These encoded characters present n vectors with fixed length l0 . The
model proposed in [9] consists of 9 layers: 6 convolutional layers and 3 fully-
connected layers. Two versions were presented: (i) a small version with an input
length of 256 and (ii) a large version where the input length is 1024.
As in Kim’s model [8], word-based CNNs, compute multi-dimentional con-
volution (n*k matrix) and max over time pooling. For the representation of a
sentence, two input channels are proposed (i.e., static and non-static). Kim’s
model is composed of a layer that performs convolutions over the embedded
word vectors predefined in Word2Vec, then max-pooling is applied to the result
of the convolutional layer and similarly to char-level models, Rectified Linear
Units (ReLu) is applied.

3 A New Framework for Argument Component Detection

In this section, we present ArguWeb which is a new framework that ensures:


(1) data gathering from different online forums and (2) argumentative text and
components detection.
In ArguWeb, arguments are extracted directly from the web. The framework
is mainly composed of two phases, pre-processing phase where arguments are
extracted from the web then segmented and argument component detection phase
where arguments’ components are detected (Fig. 2). The second phase is ensured
by using trained character-level Convolutional Neural Network and word-based
Convolutional Neural Network. An experimental study on the performance of
character-level and word-based CNNs for Argumentation Mining was conducted
for both sub-tasks argumentative sentence detection and argument component
detection. In this work, we consider both situations: in-domain and cross-domain.
We also, compare character-level CNNs performances to word-based CNNs, SVM
and Naı̈ve Bayes.
Using CNN in Cross-Domain Argumentation Mining Framework 359

Argument component detecƟon phase

Pre-processing phase Segments

ArgumentaƟve
sentences
detecƟon
ArgumentaƟve
Segments
Argument
Web scrapping Claim components
detecƟon <claim,
Text premises>
ArgumentaƟve
Sentence Segments
segmentaƟon
Premises
detecƟon
Text segments

Fig. 2. An overview of ArguWeb architecture.

3.1 Pre-processing Phase

Using web scraping techniques, we extract users comments from several sources
(e.g., social media, online forums). This step is ensured by a set of scrappers
developed using Python. Then, extracted data enters the core of our system
which is a set of trained models. More details about the argument component
detection phase are presented in the next sub-section.
Existing AM contributions treat different input’s granularity (i.e., paragraph,
sentence, intra-sentence) and most of the existing work focuses on sentences and
intra-sentences level [3]. In this work we are focusing on sentence level and we
suppose that the whole sentence coincides with an argument component (i.e.,
claim or premise). Therefore, after scrapping users’ comments from the web
a text segmentation is required. Collected comments will be segmented into
sentences based on the sentence tokenization using Natural Language Toolkit
[20].

3.2 Argument Component Detection Phase

Argument component detection phase consists in: (1) detecting argumentative


texts from non argumentative texts generated in the pre-processing phase and (2)
detecting arguments components (i.e., claims and premises) from argumentative
segments. We follow the premise/claim model where:

Definition 1. An argument is a tuple:


Argument = <Premises, claims>
360 R. Bouslama et al.

Where the claims are the defended ideas and the premises present explanations,
proofs, facts, etc. that backup the claims.

Figure 2 depicts the followed steps in the argument component detection pro-
cess which consists of detecting argumentative sentences from non-argumentative
sentences, detecting claims in argumentative sentences and then detecting
premises presented to backup the claim. For the three tasks, word-based and
character-based CNNs are trained on three different corpora.
Using char-level CNN, an alphabet containing all the letters, numbers and
special characters is used to encode the input text. Each input character is
quantified using one-hot encoding. A Stochastic Gradient Descent is used as an
optimizer with mini-batches of size 32. As for word-based CNN, we do not use
two input channels as proposed in the original paper and instead we use only
one channel. We also ensure word embedding using an additional first layer that
embeds words into low-dimensional vectors. We use the ADAM algorithm as an
optimizer.

4 Experimental Study

In this section we detail the conducted experiments and we get deeper in


ArguWeb components. We also describe the corpora used to train the frame-
work models and we discuss the main results.

4.1 Experimental Protocol

We aim to evaluate the performance of ArguWeb in terms of arguments detection


and argument component detection. For this end, we experiment two classic
machine learning classifiers namely, SVM and Naı̈ve Bayes as well as two deep
learning classifiers namely, char-level CNN and word-level CNN.

Data. We perform our investigation on three different corpora:

– Persuasive Essays corpora [13]: which consists of over 400 persuasive essays
written by students. All essays have been segmented by three expert annota-
tors into three types of argument units (i.e., major claim, claim and premises).
In this paper we follow a claim/premises argumentation model, so to ensure
comparability between data sets, each major claim is considered as a claim.
This corpora presents a domain specific data (i.e., argumentative essays). The
first task that we evoke in this paper is argumentative sentences/texts detec-
tion. For this matter, we added non-argumentative data to argumentative
essays corpora. Descriptive and narrative text are extracted from Academic
Help2 and descriptive short stories from The short story website3 .

2
https://academichelp.net/.
3
https://theshortstory.co.uk.
Using CNN in Cross-Domain Argumentation Mining Framework 361

– Web Discourse corpora [16]: contains 990 comments and forum posts labeled
as persuasive or non-persuasive and 340 documents annotated with the
extended Toulming model to claim, grounds, backing, and rebuttal and refu-
tation. For this data set we consider grounds and backing as premises since
they backup the claim while rebuttal and refutation are ignored. This corpora
presents domain-free data.
– N-domain corpora: we construct a third corpora by combining both persua-
sive essays and web-discourse corpora to make the heterogeneity of data even
more intense with the goal to investigate the performance of the different
models in a multi domain context.

Data description in term of classes distributions is depicted in Table 1,


where the possible classes are: Argumentative (Arg), Not-argumentative
(Not-arg), Claim, Not-claim, Premise, Not-premise. Each row depicts the num-
ber of instances in each class for each corpora.

Table 1. Classes distribution in each corpora

Corpora Arg Not-arg Claim Not-claim Premise Not-premise


Essays 402 228 2250 748 3706 2019
Web-discourse 526 461 275 526 575 543
N-domain 928 689 2525 1274 4281 2562

In order to train SVM and Naı̈ve Bayes, a pre-processing data phase and
a set of features (e.g., semantic, syntactic) are required. For this purpose, we
apply word tokenization to tokenize all corpora into words and we also apply
both word lemmatization and stemming. Thus, words such as “studies” will give
us a stem “studi” and a lemma “study”. This gives an idea on the meaning and
the role of a given word. Indeed, before lemmatization and stemming, we use
POS-tag (Part-Of-Speech tag) technique to indicate the grammatical category of
each word in a sentence (i.e., noun, verb, adjective, adverb). We also consider the
well known TF-IDF technique [25] that outperforms the bag-of-word techniques
and stands for Term-Frequency of a given word in a sentence and for the Inverse-
Document-Frequency (IDF) that measures a word’s rarity in the vocabulary of
each corpora.
As for word-based CNN, we only pad each sentence to the max sentence
length in order to batch data in an efficient way and build a vocabulary index
used to encode each sentence as a vector of integers. For character-level CNN,
we only remove URL and hash tags from the original data.

Experiment Process Description. For each corpora (i.e., Essays, Web-


Discourse, n-domain) three models of each classifier are constructed, one to
detect argumentative segments, one to detect claims and the other one to detect
362 R. Bouslama et al.

premises. As for the cross-domain case, six models of each classifier are trained in
each sub-task (i.e., arguments detection, claim detection and premise detection)
where each model is trained on one corpora and tested on another one. This will
guarantee the cross-domain context.
To train char-level CNN models, We start with a learning rate equal to 0.01
and halved each 3 epochs. For all corpora, the epoch size is fixed to 10 yet,
the training process may be stopped if for 3 consecutive epochs the validation
loss did not improve. The loss is calculated using cross-entropy loss function.
Two versions of char-level CNN exist, a small version (i.e., the inputs length is
256) that is used for in-domain training and the large version (i.e., the inputs
length is 1024) is used for cross-domain model training. As for word-based CNN
models, we use the same loss function (i.e., cross-entropy loss) and we optimize
the loss using the ADAM optimizer. We use a 128 dimension of characters for
the embedding phase, filter sizes equal to 3,4,5, a mini-batch size equal to 128. In
addition, a dropout regularization (L2) is applied to avoid overfitting set equal
to 5 and a dropout rate equal to 0.5. The classification of the result is ensured
using a softmax layer. In this paper, we do not use Word2Vec pre-trained word
vector, instead we ensure word embedding from scratch.
In order to train char-based CNN, SVM and Naı̈ve Bayes models, we split
each data-set to 80% for training and 20% for validation while to train word-
based CNN models we split data to 80% and 10% following the original paper
of word-based CNN [8].

Implementation Environment. The whole ArguWeb framework is developed


using Python. The web scrappers are developed to extract users’ comments from
forums websites such as Quora4 using BeautifulSoup and Requests libraries on
python. The extracted comments are segmented using Natural Language Toolkit
[20] on python. In order to develop and train the SVM and Naı̈ve Bayes models,
we use the NLTK, Sklearn and collections predefined packages. Moreover, both
char-level and word-based CNN are implemented using TensorFlow library.

Evaluation Metrics. To evaluate the models performances we use the most


used and recommended metric in the state of the art (i.e., Macro F1-score) for
both argument sentences detection and argument component detection since it
treats all classes equally (e.g., argumentative and non-argumentative) and it is
the most suitable in cases of imbalanced class distribution.
The macro F1-score is the harmonic mean of the macro average precision
and the macro average recall:
F 1 − Score = 2 ∗ (Recall ∗ P recision)/(Recall + P recision) (1)
where for each class i
n
Recall = (T rueP ositivei /T rueP ositivei + F alseN egativei )/n (2)
i=1
4
https://www.quora.com/.
Using CNN in Cross-Domain Argumentation Mining Framework 363

and
n

P recision = (T rueP ositivei /T rueP ositivei + F alseP ositivei )/n (3)
i=1

with n is the number of classes.


Precision refers to the percentage of the results which are relevant and recall
refers to the percentage of total relevant results correctly classified which makes
F1-score an efficient metric for models evaluation even in cases of uneven class
distribution.

4.2 Experimental Results

We evaluate character-level CNN, word-based CNN, SVM and Naı̈ve Bayes in


two situations: in-domain and cross-domain based on macro F1-score. Table 2,
depicts the macro F1-scores found using these models trained and tested on
the same corpora if we consider in-domain setting or trained on one corpora
and tested on another for cross-domain case. In Table 2, AD, CD, PD stands
for Argument Detection, Claim Detection and Premise Detection, respectively.
The in-domain results are presented in columns with a gray background. Other
columns present the cross-domain macro F1-scores where models are trained on
one of the training sets presented in the second header row and tested on corpora
presented on the first header row. Then, each row presents the results of one of
the proposed models in this paper. The highest value in each column is marked
in bold.
In the first task (i.e., argument detection), word-based CNN outperforms all
other models in the essays corpora while SVM and char-level CNN present close
results in the web-discourse corpora with macro-F1 scores equal to 0.45 and
0.52, respectively. In n-domain corpora, SVM outperforms the other models.
In case of cross-domain setting, word-based and char-level CNN present better
results than SVM and Naı̈ve Bayes except where models were tested on n-domain
corpora. This may be explained by the fact that in this case, testing data is
close to training data since the test set presents instances from both Essays
corpora and Web-discourse corpora as explained before. Char-level and word-
based CNNs present better results than SVM and Naı̈ve Bayes in the second task
(i.e., claim detection). As for the last task (i.e., premise detection), char-level
CNN outperforms the rest of the models remarkably.
Web-discourse corpora contains domain-free data extracted from the web.
Thus, this data contains many misspelled words, internet acronyms, out-of-
vocabulary words, etc. This explains the fact that char-level CNN outperforms
the rest of the models presented in this work in many cases and for both in-
domain and cross-domain situations. This shows the importance of character
level CNN and how it performs interesting results even if the model is trained
on a noisy data.
364 R. Bouslama et al.

Table 2. The in-domain and cross-domain macro F1-scores. Each row represents the
results of one of the models (character-level CNN, word-level CNN, SVM and Naı̈ve
Bayes), the highest value is marked in bold.

Test on
Test on Essays Test on n-domain
Web-discourse

n-domain

n-domain

n-domain
Web- discourse
Web-discourse

Web-discourse
Essays

Essays

Essays
Task Models
Char-level CNN 0.72 0.93 0.56 0.56 0.52 0.81 0.51 0.46 0.62
Word-based CNN 0.74 0.35 0.33 0.33 0.39 0.83 0.33 0.36 0.44
AD
SVM 0.66 0.30 0.88 0.37 0.54 0.61 0.68 0.29 0.83
Naı̈ve Bayes 0.11 0.63 0.57 0.25 0.34 0.37 0.36 0.63 0.75
Char-level CNN 0.83 0.59 0.57 0.28 0.54 0.88 0.45 0.49 0.82
Word-based CNN 0.98 0.34 0.69 0.59 0.45 0.51 0.61 0.34 0.44
CD
SVM 0.61 0.55 0.37 0.48 0.43 0.68 0.47 0.55 0.89
Naı̈ve Bayes 0.50 0.10 0.12 0.26 0.49 0.30 0.24 0.11 0.59
Char-level CNN 0.80 0.98 0.80 0.51 0.78 0.75 0.59 0.91 0.82
Word-based CNN 0.43 0.35 0.44 0.37 0.60 0.33 0.33 0.85 0.78
PD
SVM 0.44 0.06 0.12 0.43 0.71 0.89 0.35 0.84 0.69
Naı̈ve Bayes 0.39 0.10 0.16 0.34 0.57 0.87 0.35 0.80 0.62

ArguWeb presented coherent results in the argument sentence detection and


the argument component detection tasks comparing to state of the art results
[1,5,18]. The conducted experiments showed how character-level CNN outper-
forms word-based CNN, SVM and Naı̈ve Bayes for noisy and web-extracted
data. Both character-level and word-based CNNs presented interesting results
compared to SVM and Naı̈ve Bayes without any need of features selection.

4.3 Illustrative Example


In what follows we better explain the role of ArguWeb. In this illustrative exam-
ple, we focus on character-level CNN since it showed interesting results when
dealing with noisy, misspelled and out-of-vocabulary data and since we are deal-
ing with data extracted from the web.
As mentioned before, the framework contains web scrappers responsible of
extracting data from different websites. For instance, we focus on online forums
such as Quora and Reddit and we extract users comments on different subjects.
Each extracted text from the web is classified based on the nine trained
character-level CNNs (i.e., one character-level CNN is trained on each corpora:
persuasive essays, web-discourse and n-domain and six others are trained on
one corpora and tested on another). A text is considered as argumentative if
at least six CNNs models classified it as argumentative. Similarly a segment is
Using CNN in Cross-Domain Argumentation Mining Framework 365

Fig. 3. Excerpt of arguments scrapped from online forum

classified as a claim (or premise) if at least six CNNs models labeled it as a claim
(resp. premise).
Figure 3a and b contain examples of comments that were extracted from
Quora. Figure 3c depicts a comment extracted from the Reddit forum platform.
These comments were detected as arguments. Figure 3a, contains a comment
extracted from an argument between many Quora’s users, Fig. 3b contains a
comment of a user convincing another one about Japenese cars and Fig. 3c con-
tains a comment of a user arguing why iPhone and Samsung users hate on each
other. Once arguments are detected, ArguWeb classify each comment’s compo-
nent to claim, premises or neither of them. Indeed, the Fig. 3 details the detected
components (i.e., claims and premises) of these arguments. Uncoloured texts seg-
ments were not classified as claims neither as premises.
Comments like “As announced by YouTube Music! Congrats, Taylor!!!” were
classified from the beginning as not-argumentative and were not processed by
models responsible to detect the different components.

5 Conclusion
This paper proposes ArguWeb a cross-domain framework for arguments detec-
tion in the web. The framework is based on a set of web scrappers that extract
users comments from the web (e.g., social media, online forums). Extracted data
is classified as: (1) argumentative or not and (2) claims, premises or neither of
them using character-level Convolutional Neural Networks and word-based Con-
volutional Neural Networks. An experimental study is conducted where both
366 R. Bouslama et al.

character-level and word-based CNN were compared to classic machine learning


classifiers (i.e., SVM and Naı̈ve Bayes). The study showed interesting results
where both versions of CNN performed interesting and challenging results to
classic machine learning techniques in both tasks. The framework is proposed to
be used to extract arguments from different platforms on the web following the
claim/premise model.
Future work will integrate ArguWeb framework in an automated
Argumentation-Based Negotiation system. The integration of up to date argu-
ments in such systems seems interesting. We will also handle arguments compo-
nents detection in intra-sentence level rather than only in sentence level. More-
over, a semantic analysis of these components will be integrated in order to
classify them to explanations, counter examples etc.

References
1. Stab, C., Gurevych, I.: Identifying argumentative discourse structures in persuasive
essays. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 46–56 (2014)
2. Cabrio, E., Villata, S.: Combining textual entailment and argumentation theory
for supporting online debates interactions. In: Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics (vol. 2: Short Papers),
pp. 208–212 (2012)
3. Lippi, M., Torroni, P.: Argumentation mining: state of the art and emerging trends.
ACM Trans. Internet Technol. (TOIT) 16(2), 10 (2016)
4. Lippi, M., Torroni, P.: MARGOT: a web server for argumentation mining. Expert
Syst. Appl. 65, 292–303 (2016)
5. Aker, A., et al.: What works and what does not: classifier and feature analysis for
argument mining. In: Proceedings of the 4th Workshop on Argument Mining, pp.
91–96 (2017)
6. Hua, X., Nikolov, M., Badugu, N., Wang, L.: Argument mining for understanding
peer reviews. arXiv preprint. arXiv:1903.10104 (2019)
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Proceedings of Advances in Neural Information
Processing Systems, pp. 1097–1105 (2012)
8. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint.
arXiv:1408.5882 (2014)
9. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text
classification. In: Proceedings of Advances in Neural Information Processing Sys-
tems, pp. 649–657 (2015)
10. Dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment anal-
ysis of short texts. In: Proceedings of COLING 2014, the 25th International Con-
ference on Computational Linguistics: Technical Papers, pp. 69–78 (2014)
11. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language
models. In: Proceedings of 30th AAAI Conference on Artificial Intelligence (2016)
12. Laha, A., Raykar, V.: An empirical evaluation of various deep learning architectures
for bi-sequence classification tasks. arXiv preprint. arXiv:1607.04853 (2016)
13. Stab, C., Gurevych, I.: Parsing argumentation structures in persuasive essays.
Comput. Linguist. 43(3), 619–659 (2017)
Using CNN in Cross-Domain Argumentation Mining Framework 367

14. Daxenberger, J., Eger, S., Habernal, I., Stab, C., Gurevych, I.: What is the essence
of a claim? Cross-domain claim identification. arXiv preprint. arXiv:1704.07203
(2017)
15. Lugini, L., Litman, D.: Argument component classification for classroom discus-
sions. In: Proceedings of the 5th Workshop on Argument Mining, pp. 57–67 (2018)
16. Habernal, I., Eckle-Kohler, J., Gurevych, I.: Argumentation mining on the web
from information seeking perspective. In: Proceedings of ArgNLP (2014)
17. Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional
neural networks concepts to hybrid NN-HMM model for speech recognition. In:
Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 4277–4280 (2012)
18. Al-Khatib, K., Wachsmuth, H., Hagen, M., Köhler, J., Stein, B.: Cross-domain
mining of argumentative text through distant supervision. In: Proceedings of the
2016 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pp. 1395–1404 (2016)
19. Wachsmuth, H., et al.: Building an argument search engine for the web. In: Pro-
ceedings of the 4th Workshop on Argument Mining, pp. 49–59 (2017)
20. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing
Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)
21. Li, M., Gao, Y., Wen, H., Du, Y., Liu, H., Wang, H.: Joint RNN model for argument
component boundary detection. In: Proceeding of the 2017 IEEE International
Conference on Systems, Man, and Cybernetics (SMC), pp. 57–62 (2017)
22. Ajjour, Y., Chen, W.F., Kiesel, J., Wachsmuth, H., Stein, B.: Unit segmentation of
argumentative texts. In: Proceedings of the 4th Workshop on Argument Mining,
pp. 118–128 (2017)
23. Stab, C., Miller, T., Gurevych, I.: Cross-topic argument mining from heterogeneous
sources using attention-based neural networks. arXiv preprint. arXiv:1802.05758
(2018)
24. Guggilla, C., Miller, T., Gurevych, I.: CNN-and LSTM-based claim classification
in online user comments. In: Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical Papers, pp. 2740–2751 (2016)
25. Sparck Jones, K.: A statistical interpretation of term specificity and its application
in retrieval. J. Doc. 28(1), 11–21 (1972)
ConvNet and Dempster-Shafer Theory
for Object Recognition

Zheng Tong , Philippe Xu , and Thierry Denœux(B)

Université de Technologie de Compiègne, CNRS, UMR 7253 Heudiasyc,


Compiègne, France
{zheng.tong,philippe.xu}@hds.utc.fr
[email protected]

Abstract. We propose a novel classifier based on convolutional neu-


ral network (ConvNet) and Dempster-Shafer theory for object recogni-
tion allowing for ambiguous pattern rejection, called the ConvNet-BF
classifier. In this classifier, a ConvNet with nonlinear convolutional lay-
ers and a global pooling layer extracts high-dimensional features from
input data. The features are then imported into a belief function clas-
sifier, in which they are converted into mass functions and aggregated
by Dempster’s rule. Evidence-theoretic rules are finally used for pattern
classification and rejection based on the aggregated mass functions. We
propose an end-to-end learning strategy for adjusting the parameters in
the ConvNet and the belief function classifier simultaneously and deter-
mining the rejection loss for evidence-theoretic rules. Experiments with
the CIFAR-10, CIFAR-100, and MNIST datasets show that hybridizing
belief function classifiers with ConvNets makes it possible to reduce error
rates by rejecting patterns that would otherwise be misclassified.

Keywords: Pattern recognition · Belief function · Convolutional


neural network · Supervised learning · Evidence theory

1 Introduction
Dempster-Shafer (DS) theory of belief functions [3,24] has been widely used for
reasoning and making decisions with uncertainty [29]. DS theory is based on
representing independent pieces of evidence by completely monotone capacities
and aggregating them using Dempster’s rule. In the past decades, DS theory has
been applied to pattern recognition and supervised classification in three main
directions. The first one is classifier fusion, in which classifier outputs are con-
verted into mass functions and fused by Dempster’s rule (e.g., [2,19]). Another
direction is evidential calibration: the decisions of classifiers are transformed into
This research was carried out in the framework of the Labex MS2T, which was funded
by the French Government, through the program “Investments for the future” managed
by the National Agency for Research (Reference ANR- 11-IDEX-0004-02). It was also
supported by a scholarship from the China Scholarship Council.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 368–381, 2019.
https://doi.org/10.1007/978-3-030-35514-2_27
ConvNet and Dempster-Shafer Theory for Object Recognition 369

mass functions (e.g., [20,28]). The last approach is to design evidential classifiers
(e.g., [6]), which represent the evidence of each feature as elementary mass func-
tions and combine them by Dempster’s rule. The combined mass functions are
then used for decision making [5]. Compared with conventional classifiers, evi-
dential classifiers can provide more informative outputs, which can be exploited
for uncertainty quantification and novelty detection. Several principles have been
proposed to design evidential classifiers, mainly including the evidential k-nearest
neighbor rule [4,9], and evidential neural network classifiers [6]. In practice, the
performance of evidential classifiers heavily depends on two factors: the training
set size and the reliability of object representation. With the development of the
“Big Data” age, the number of examples in benchmark datasets for supervised
algorithms has increased from 102 to 105 [14] and even 109 [21]. However, little
has been done to combine recent techniques for object representation with DS
theory.
Thanks to the explosive development of deep learning [15] and its applications
[14,25], several approaches for object representation have been developed, such
as restricted Boltzmann machines [1], deep autoencoders [26,27], deep belief net-
works [22,23], and convolutional neural networks (ConvNets) [12,17]. ConvNet,
which is maybe the most promising model and the main focus of this paper,
mainly consists of convolutional layers, pooling layers, and fully connected lay-
ers. It has been proved that ConvNets have the ability to extract local features
and compute global features, such as from edges to corners and contours to
object parts. In general, robustness and automation are two desirable properties
of ConvNets for object representation. Robustness means strong tolerance to
translation and distortion in deep representation, while automation implies that
object representation is data-driven with no human assistance.
Motivated by recent advances in DS theory and deep learning, we propose to
combine ConvNet and DS theory for object recognition allowing for ambiguous
pattern rejection. In this approach, a ConvNet with nonlinear convolutional
layers and a global pooling layer is used to extract high-order features from
input data. Then, the features are imported into a belief function classifier, in
which they are converted into Dempster-Shafer mass functions and aggregated by
Dempster’s rule. Finally, evidence-theoretic rules are used for pattern recognition
and rejection based on the aggregated mass functions. The performances of this
classifier on the CIFAR-10, CIFAR-100, and MNIST datasets are demonstrated
and discussed.
The organization of the rest of this paper is as follows. Background knowledge
on DS theory and ConvNet is recalled in Sect. 2. The new combination between
DS theory and ConvNet is then established in Sect. 3, and numerical experiments
are reported in Sect. 4. Finally, we conclude the paper in Sect. 5.

2 Background
In this section, we first recall some necessary definitions regarding the DS theory
and belief function classifier (Sect. 2.1). We then provide a description of the
370 Z. Tong et al.

architecture of a ConvNet that will be combined with a belief function classifier


later in the paper (Sect. 2.2).

2.1 Dempster-Shafer Theory

Evidence Theory. The main concepts regarding DS theory are briefly presented
in this section, and some basic notations are introduced. Detailed information can
be found in Shafer’s original work [24] and some up-to-date studies [8].
Given a finite set Ω = {ω1 , · · · , ωk }, called the frame of discernment, a mass
function is a function m from 2Ω to [0,1] verifying m(∅) = 0 and

m(A) = 1. (1)
A⊆Ω

For any A ⊆ Ω, given a certain piece of evidence, m(A) can be regarded as the
belief that one is willing to commit to A. Set A is called a focal element of m
when m(A) > 0.
For all A ⊆ Ω, a credibility function bel and a plausibility function pl,
associated with m, are defined as

bel(A) = m(B) (2)
B⊆A


pl(A) = m(B). (3)
A∩B=∅

The quantity bel(A) is interpreted as a global measure of one’s belief that


hypothesis A is true, while pl(A) is the amount of belief that could potentially
be placed in A.
Two mass functions m1 and m2 representing independent items of evidence
can be combined by Dempster’s rule ⊕ [3,24] as

m1 (B) m2 (C)
B∩C=A
(m1 ⊕ m2 ) (A) =  (4)
m1 (B) m2 (C)
B∩C=∅

for all A = ∅ and (m1 ⊕ m2 )(∅) = 0. Mass functions m1 and m2 can be combined
if and only if the denominator on the right-hand side of (4) is strictly positive.
The operator ⊕ is commutative and associative.

Belief Function Classifier. Based on DS theory, an adaptive pattern classifier,


called belief function classifier, was proposed by Denœux [6]. The classifier uses
reference patterns as items of evidence regarding the class membership. The
evidence is represented by mass functions and combined using Dempster’s rule.
ConvNet and Dempster-Shafer Theory for Object Recognition 371

(a) Architecture of a belief function classifier (b) Connection between layers L2 and L3

Fig. 1. Belief function classifier

In this section, we describe the architecture of a belief function classifier. For


a more complete introduction, readers are invited to refer to Denœux’s original
work [6].
We denote by x ∈ RP a pattern to be classified into one of M classes
ω1 , · · · , ωM , and by X a training set of N P -dimensional patterns. A belief
function classifier quantifies the uncertainty about the class of x by a belief
function on Ω = {ω1 , · · · , ωM }, using a three-step procedure. This procedure
can also be implemented in a multi-layer neural network illustrated in Fig. 1. It
is based on n prototypes p1 , · · · , pn , which are the weight vectors of the units
in the first hidden layer L1 . The three steps are defined as follows.

Step 1: The distance between x and each prototype pi is computed as


 
di = x − pi  i = 1, · · · , n, (5)
and the activation of the corresponding neuron is defined by introducing new
 2
parameters η i (η i ∈ R) as si = αi exp(− η i di ), where αi ∈ (0, 1) is a parame-
ter associated to the prototype pi .

Step 2: The mass function mi associated to prototype pi is computed as


mi = (mi ({ω1 }), . . . , mi ({ωM }), mi (Ω))T (6a)
= (ui1 si , . . . , uiM si , 1 i T
−s ) , (6b)
i
where u = (ui , . . . , ui ) is
a vector of parameters associated to the prototype
i
1M i M
p verifying j=1 uj = 1.
As illustrated in Fig. 1a, Eq. (6) can be regarded as computing the activations
of units in the second hidden layer L2 , composed of n modules of M + 1 units
each. The units of module i are connected to neuron i of the previous layer. The
output of module i in the hidden layer corresponds to the belief masses assigned
by mi .
372 Z. Tong et al.

Step 3: The n mass functions mi , i = 1, · · · , n, are combined in the final layer


based on Dempster’s rule as shown in Fig. 1b. The vectors of activations μi =
(μi1 , · · · , μiM +1 ), i = 1, . . . , n of the final layer L3 is defined by the following
equations:
μ1 = m 1 , (7a)
μij = μi−1 i i−1 i i−1 i
j m ({ωj }) + μj m ({Ω}) + μM +1 m ({ωj }) (7b)
for i = 2, · · · , n and j = 1, · · · , M , and

μiM +1 = μi−1 i
M +1 m ({Ω}) i = 2, · · · , n. (7c)

The classifier outputs m = (m({ω1 }), . . . , m({ωM }), m(Ω))T is finally obtained
as m = μn .

Evidence-Theoretic Rejection Rules. Different strategies to make a deci-


sion (e.g., assignment to a class or rejection) based on the possible consequences
of each action were proposed in [5]. For a complete training set X , we consider
actions αi , i ∈ {1, · · · , M } assigning the pattern to each class and a rejection
action α0 . Assuming the cost of correct classification to be 0, the cost of mis-
classification to be 1 and the cost of rejection to be λ0 , the three conditions for
rejection reviewed in [5] can be expressed as

Maximum credibility: maxj=1,··· ,M m({ωj }) < 1 − λ0


Maximum plausibility: maxj=1,··· ,M m({ωj }) + m(Ω) < 1 − λ0
Maximum pignistic probability: maxj=1,··· ,M m({ωj }) + m(Ω)
M < 1 − λ0 .

Otherwise, the pattern is assigned to class ωj with j = arg maxk=1,··· ,M m({ωk }).
For the maximum plausibility and maximum pignistic probability rules, rejection
is possible if and only if 0 ≤ λ0 ≤ 1 − 1/M , whereas a rejection action for the
maximum credibility rule only requires 0 ≤ λ0 ≤ 1.

2.2 Convolutional Neural Network

In this section, we provide a brief description of some state-of-the-art techniques


for ConvNets including the nonlinear convolutional operation and global average
pooling (GAP), which will be implemented in our new model in Sect. 3. Detailed
information about the two structure layers can be found in [18].

Nonlinear Convolutional Operation. The convolutional layer [15] is highly


efficient for feature extraction and representation. In order to approximate the
representations of the latent concepts related to the class membership, a novel
convolutional layer has been proposed [18], in which nonlinear multilayer per-
ceptron (MLP) operations replace classic convolutional operations to convolve
over the input. An MLP layer with nonlinear convolutional operations can be
summarized as follows:
ConvNet and Dempster-Shafer Theory for Object Recognition 373

 T 
1
fi,j,k = ReLU wk1 · x + b1k , k = 1, · · · , C (8a)
..
.
 
m T m−1
fi,j,k = ReLU (wkm ) · fi,j + bm
k , k = 1, · · · , C. (8b)

Here, m is the number of layers in an MLP. Matrix x, called receptive field


of size i × j × o, is a patch of the input data with the size of (rW − r − p +
i) × (rH − r − p + j) × o. An MLP layer with an r stride and a p padding can
generate a W × H × C tensor, called feature maps. The size of a feature map
is W × H × 1, while the channel number of the feature maps is C. A rectified
linear unit (ReLU) is used as an activation function as ReLU (x) = max(0, x). As
shown in Eq. (8), element-by-element multiplications are first performed between
x and the transpositions of the weight matrices wk1 (k = 1, · · · , C) in the 1st
layer of the MLP. Each weight matrix wk1 has the same size as the receptive
field. Then the multiplied values are summed, and the bias b1k (k = 1, · · · , C) is
added to the summed values. The results are transformed by a ReLU function.
1
The output vector is fi,j 1
= (fi,j,1 1
,fi,j,2 ,· · ·,fi,j,C
1
). The outputs then flow into the
m
remaining layers in sequence, generating fi,j of size 1 × 1 × C. After processing
all patches by the MLP, the input data is transformed into a W × H × C tensor.
As the channel number C of the last MLP in a ConvNet is the same as the input
data dimension P in a belief function classifier, a W × H × P tensor is finally
generated by a ConvNet.

Global Average Pooling. In a traditional ConvNet, the tensor is vectorized


and imported into fully connected layers and a softmax layer for a classification
task. However, fully connected layers are prone to overfitting, though dropout
[11] and its variation [10] have been proposed. A novel strategy, called global
average pooling (GAP), has been proposed to remove traditional fully connected
layers [18]. A GAP layer transforms the feature tensor W × H × P into a feature
vector 1 × 1 × P by taking the average of each feature map as follows:
W 
 H
m
fi,j,k
i=1 j=1
xk = k = 1, · · · , P. (9)
W ·H
The generated feature vector is used for classification. From the belief func-
tion perspective, the feature vector can be used for object representation and
classified in one of M classes or rejected by a belief function classifier. Thus, a
ConvNet can be regarded as a feature generator.

3 ConvNet-BF Classifier
In this section, we present a method to combine a belief function classifier and a
ConvNet for objection recognition allowing for ambiguous pattern rejection. The
374 Z. Tong et al.

architecture of the proposed method, called ConvNet-BF classifier, is illustrated


in Fig. 2. A ConvNet-BF classifier can be divided into three parts: a ConvNet as
a feature producer, a belief function classifier as a mass-function generator, and
a decision rule. In this classifier, input data are first imported into a ConvNet
with nonlinear convolutional layers and a global pooling layer to extract latent
features related to the class membership. The features are then imported into
a belief function classifier, in which they are converted into mass functions and
aggregated by Dempster’s rule. Finally, an evidence-theoretic rule is used for
pattern classification and rejection based on the aggregated mass functions. As
the background of the three parts has been introduced in Sect. 2, we only pro-
vide the details of the combination in this section, including the connectionist
implementation and the learning strategy.

3.1 Connectionist Implementation


In a ConvNet-BF classifier, the Euclidean distance between a feature vector and
each prototype is first computed and then used to generate a mass function. To
reduce the classification error when P is large, we assign weights to each feature as


P


d =
i
wki (xk − pik )2 , (10)
k=1

and the weights are normalized by introducing new parameters ζki (ζki ∈ R) as

(ζki )2
wki = P
. (11)

i
(ζl ) 2

l=1

3.2 Learning
The proposed learning strategy to train a ConvNet-BF classifier consists in two
parts: (a) an end-to-end training method to train ConvNet and belief function
classifier simultaneously; (b) a data-driven method to select λ0 .

Fig. 2. Architecture of a ConvNet-BF classifier


ConvNet and Dempster-Shafer Theory for Object Recognition 375

End-to-End Training. Compared with the belief function classifier proposed


in [6], we have different expressions for the derivatives w.r.t. wki , ζki , and pik in
the new belief function classifier. A normalized error Eν (x) is computed as:
I M
1 
Eν (x) = (P reν,q,i − T arq,i )2 , (12a)
2N i=1 q=1
 
P reν,q,i = mq,i + νmM +1,i , (12b)
 mi
mi = M +1 . (12c)
k=1 mi ({ωk })
Here, T ar i = (T ar1,i , · · · , T arM,i ) and mi = (mi ({ω1 }), . . . , mi ({ωM }),
mi (Ω))T are the target output vector and the unnormalized network out-
put vector for pattern xi , respectively. We transform mi to a vector
(P reν,1,i , . . . , P reν,M,i ) by distributing a fraction ν of mi (Ω) to each class under
the constraint 0 ≤ ν ≤ 1. The numbers P re1,q,i , P re0,q,i and P re1/M,q,i repre-
sent, respectively, the credibility, the plausibility, and the pignistic probability
of class ωq . The derivatives of Eν (x) w.r.t pik , wki , and ζki in a belief function
classifier can be expressed as

P
∂Eν (x) ∂Eν (x) ∂si ∂Eν (x) i 2 i
= = · 2(η ) s · wki (xk − pik ), (13)
∂pik ∂si ∂pik ∂si k=1

∂Eν (x) ∂Eν (x) ∂si ∂Eν (x)  i 2 i  2


i
= i i
= i
· η s · xk − pik , (14)
∂wk ∂s ∂wk ∂s
and
∂Eν (x) ∂Eν (x) ∂wki
i
= (15a)
∂ζk ∂wki ∂ζki
 P P

2ζki ∂Eν (x)  i 2  i 2 ∂Eν (x)
= 2 (ζk ) − (ζk ) . (15b)
P
  i 2 ∂wki ∂wki
k=1 k=1
ζk
k=1
m
Finally, the derivatives of the error w.r.t. xk , wi,j,k and bm
k in the last MLP
are given as

P
∂Eν (x) ∂Eν (x) ∂si ∂Eν (x) i 2 i
= = − · 2(η ) s · ωki (xk − pik ), (16)
∂xk ∂si ∂xk ∂si k=1
m
∂Eν (x) ∂Eν (x) ∂fi,j,k m ∂Eν (x)
m = m · m = wi,j,k · m k = 1, · · · , P, (17)
∂wi,j,k ∂fi,j,k ∂wi,j,k ∂fi,j,k
and m
∂Eν (x) ∂Eν (x) ∂fi,j,k ∂Eν (x)
m = m · m = m k = 1, · · · , P (18)
∂bk ∂fi,j,k ∂bk ∂fi,j,k
376 Z. Tong et al.

with
∂Eν (x) ∂Eν (x) ∂xk 1 ∂Eν (x)
m = · m = k = 1, · · · , P. (19)
∂fi,j,k ∂xk ∂fi,j,k W · H ∂xk
m
Here, wi,j,k is the component of the weight matrix wkm , while fi,j,k
m
is the com-
m
ponent of vector fi,j in Eq. (8).

Determination of λ0 . A data-driven method for determining λ0 to guarantee


a ConvNet-BF classifier with a certain rejection rate is shown in Fig. 3. We
randomly select three-fifths of a training set χ to train a ConvNet-BF classifier,
while random one-fifth of the set is used as a validation set. The remaining one-
(1)
fifth of the set is used to draw a λ0 -rejection curve. We can determine the value
(1)
of λ0 for a certain rejection rate from the curve. We repeat the process and
(i)
take the average of λ0 as the final λ0 for the desired rejection rate.

Fig. 3. Illustration of the procedure for determining λ0

4 Numerical Experiments
In this section, we evaluate ConvNet-BF classifiers on three benchmark datasets:
CIFAR-10 [13], CIFAR-100 [13], and MNIST [16]. To compare with traditional
ConvNets, the architectures and training strategies of the ConvNet parts in
ConvNet-BF classifiers are the same as those used in the study of Lin et al.,
called NIN [18]. Feature vectors from the ConvNet parts are imported into a
belief function classifier in our method, while they are directly injected into
softmax layers in NINs.
In order to make a fair comparison, a probability-based rejection rule is
adopted for NINs as maxj=1,··· ,M pj < 1 − λ0 , where pj is the output probability
of NINs.

4.1 CIFAR-10
The CIFAR-10 dataset [13] is made up of 60,000 RGB images of size 32 × 32
partitioned in 10 classes. There are 50,000 training images, and we randomly
selected 10,000 images as validation data for the ConvNet-BF classifier. We
then randomly used 10,000 images of the training set to determine λ0 .
ConvNet and Dempster-Shafer Theory for Object Recognition 377

The test set error rates without rejection of the ConvNet-BF and NIN classi-
fiers are 9.46% and 9.21%, respectively. The difference is small but statistically
significant according to McNemar’s test (p-value: 0.012). Error rates without
rejection mean that we only consider maxj=1,··· ,M pj and maxj=1,··· ,M m ({ωj }).
If the selected class is not the correct one, we regard it as an error. It turns
out in our experiment that using a belief function classifier instead of a softmax
layer only slightly impacts the classifier performance.
The test set error rates with rejection of the two models are presented in
Fig. 4a. A rejection decision is not regarded as an incorrect classification. When
the rejection rate increases, the test set error decreases, which shows that the
belief function classifier rejects a part of incorrect classification. However, the
error decreases slightly when the rejection rate is higher than 7.5%. This demon-
strates that the belief function classifier rejects more and more correctly classified
(i)
patterns with the increase of rejection rates. Thus, a satisfactory λ0 should be
determined to guarantee that the ConvNet-BF classifier has a desirable accuracy
rate and a low correct-rejection rate. Additionally, compared with the NIN, the
ConvNet-BF classifier rejects significantly more incorrectly classified patterns.
For example, the p-value of McNemar’s test for the difference of error rates
between the two classifiers with a 5.0% rejection rate is close to 0. We can con-
clude that a belief function classifier with an evidence-theoretic rejection rule is
more suitable for making a decision allowing for pattern rejection than a softmax
layer and the probability-based rejection rule.
Table 1 presents the confusion matrix of the ConvNet-BF classifier with the
maximum credibility rule, whose rejection rate is 5.0%. The ConvNet-BF clas-
sifier tends to select rejection when there are two or more similar patterns,
such as dog and cat, which can lead to incorrect classification. In the view of
evidence theory, the ConvNet part provides conflicting evidence when two or
more similar patterns exist. The maximally conflicting evidence corresponds to
m ({ωi }) = m ({ωj }) = 0.5 [7]. Additionally, the additional mass function m (Ω)
provides the possibility to verify whether the model is well trained because we
have m (Ω) = 1 when the ConvNet part cannot provide any useful evidence.

4.2 CIFAR-100
The CIFAR-100 dataset [13] has the same size and format at the CIFAR-10
dataset, but it contains 100 classes. Thus the number of images in each class is
only 100. For CIFAR-100, we also randomly selected 10,000 images of the training
set to determine λ0 . The ConvNet-BF and NIN classifiers achieved, respectively,
40.62% and 39.24% test set error rates without rejection, a small but statisti-
cally significant difference (p-value: 0.014). Similarly to CIFAR-10, it turns out
that the belief function classifier has a similar error rate as a network with a
softmax layer. Figure 4b shows the test set error rates with rejection for the two
models. Compared with the rejection performance in CIFAR-10, the ConvNet-
BF classifier rejects more incorrect classification results. We can conclude that
the evidence-theoretic classifier still performs well when the classification task is
difficult and the training set is not adequate. Similarly, Table 2 shows that the
378 Z. Tong et al.

Fig. 4. Rejection-error curves: CIFAR-10 (a), CIFAR-100 (b), and MNIST (c)

Table 1. Confusion matrix for Cifar10.

Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck
Airplane - 0.03 0.03 0.01 0.02 0.05 0.04 0.01 0.04 0.05
Automobile 0 - 0.04 0.04 0.08 0.08 0.04 0.06 0.03 0.07
Bird 0.02 0.04 - 0.05 0.04 0.07 0.03 0.08 0 0.04
Cat 0.02 0.03 0.13 - 0.06 0.44 0.11 0.04 0.05 0.06
Deer 0.01 0.04 0.07 0.12 - 0.03 0.12 0.34 0.04 0.08
Dog 0.02 0.03 0.05 0.49 0.11 - 0.06 0.09 0.01 0.04
Frog 0.02 0.04 0.08 0.06 0.12 0.06 - 0.06 0.06 0.05
Horse 0.01 0.02 0.04 0.06 0.31 0.10 0.04 - 0.04 0.04
Ship 0.04 0.05 0.02 0.04 0.12 0.05 0.04 0.18 - 0.02
Truck 0.02 0 0.06 0.09 0.03 0.06 0.07 0.06 0.04 -
Rejection 0.20 0.13 0.14 1.05 0.84 1.07 0.14 1.14 0.18 0.11
ConvNet and Dempster-Shafer Theory for Object Recognition 379

ConvNet-BF classifier tends to select the rejection action when two classes are
similar, in which case we have m ({ωi }) ≈ m ({ωj }). In contrast, the classifier
tends to produce m(Ω) ≈ 1 when the model is not trained well because of an
inadequate training set.

4.3 MNIST
The MNIST database of handwritten digits consists of a training set of 60,000
examples and a test set of 10,000 examples. The training strategy for the
ConvNet-BF classifier was the same as the strategy in CIFAR-10 and CIFAR-
100. The test set error rates without rejection of the two models are close (0.88%
and 0.82%) and weakly signifiant (p-value: 0.077). Again, using a belief function
classifier instead of a softmax layer introduce no negative effect on the network
in MNIST. The test set error rates with rejection of the two models are shown
in Fig. 4c. The ConvNet-BF classifier rejects a small number of classification
results because the feature vectors provided by the ConvNet part include little
confusing information.

Table 2. Confusion matrix for the superclass flowers.

Orchids Poppies Roses Sunflowers Tulips


Orchids - 0.24 0.23 0.28 0.15
Poppies 0.14 - 0.43 0.10 0.90
Roses 0.27 0.12 - 0.16 0.13
Sunflowers 0.18 0.15 0.12 - 0.22
Tulips 0.08 1.07 0.76 0.17 -
Rejection 0.09 0.37 0.63 0.12 0.34

5 Conclusion
In this work, we proposed a novel classifier based on ConvNet and DS theory for
object recognition allowing for ambiguous pattern rejection, called “ConvNet-BF
classifier”. This new structure consists of a ConvNet with nonlinear convolutional
layers and a global pooling layer to extract high-dimensional features and a belief
function classifier to convert the features into Dempster-Shafer mass functions.
The mass functions can be used for classification or rejection based on evidence-
theoretic rules. Additionally, the novel classifier can be trained in an end-to-end
way.
The use of belief function classifiers in ConvNets had no negative effect on the
classification performances on the CIFAR-10, CIFAR-100, and MNIST datasets.
The combination of belief function classifiers and ConvNet can reduce the errors
by rejecting a part of the incorrect classification. This provides a new direction
to improve the performance of deep learning for object recognition. The classifier
380 Z. Tong et al.

is prone to assign a rejection action when there are conflicting features, which
easily yield incorrect classification in the traditional ConvNets. In addition, the
proposed method opens a way to explain the relationship between the extracted
features in convolutional layers and class membership of each pattern. The mass
m(Ω) assigned to the set of classes provides the possibility to verify whether a
ConvNet is well trained or not.

References
1. Bengio, Y.: Learning deep architectures for AI. Found. Trends R Mach. Learn.
2(1), 1–127 (2009)
2. Bi, Y.: The impact of diversity on the accuracy of evidential classifier ensembles.
Int. J. Approximate Reasoning 53(4), 584–607 (2012)
3. Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping.
In: Yager, R.R., Liu, L. (eds.) Classic Works of the Dempster-Shafer Theory of
Belief Functions. STUDFUZZ, vol. 219, pp. 57–72. Springer, Heidelberg (2008).
https://doi.org/10.1007/978-3-540-44792-4 3
4. Denœux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer
theory. IEEE Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
5. Denœux, T.: Analysis of evidence-theoretic decision rules for pattern classification.
Pattern Recogn. 30(7), 1095–1107 (1997)
6. Denœux, T.: A neural network classifier based on Dempster-Shafer theory. IEEE
Trans. Syst. Man Cybern. Part A Syst. Hum. 30(2), 131–150 (2000)
7. Denœux, T.: Logistic regression, neural networks and Dempster-Shafer theory: a
new perspective. Knowl.-Based Syst. 176, 54–67 (2019)
8. Denœux, T., Dubois, D., Prade, H.: Representations of uncertainty in artificial
intelligence: beyond probability and possibility. In: Marquis, P., Papini, O., Prade,
H. (eds.) A Guided Tour of Artificial Intelligence Research, Chap. 4. Springer
(2019)
9. Denœux, T., Kanjanatarakul, O., Sriboonchitta, S.: A new evidential K-nearest
neighbor rule based on contextual discounting with partially supervised learning.
Int. J. Approximate Reasoning 113, 287–302 (2019)
10. Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Targeted dropout. In:
CDNNRIA Workshop at the 32nd Conference on Neural Information Processing
Systems (NeurIPS 2018), Montréal (2018)
11. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
preprint arXiv:1207.0580 (2012)
12. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing,
Doha, pp. 1746–1751 (2014)
13. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.
University of Toronto, Technical report (2009)
14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017)
15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning
applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
ConvNet and Dempster-Shafer Theory for Object Recognition 381

17. Leng, B., Liu, Y., Yu, K., Zhang, X., Xiong, Z.: 3D object understanding with 3D
convolutional neural networks. Inf. Sci. 366, 188–201 (2016)
18. Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on
Learning Representations (ICLR 2014), Banff, pp. 1–10 (2014)
19. Liu, Z., Pan, Q., Dezert, J., Han, J.W., He, Y.: Classifier fusion with contextual
reliability evaluation. IEEE Trans. Cybern. 48(5), 1605–1618 (2018)
20. Minary, P., Pichon, F., Mercier, D., Lefevre, E., Droit, B.: Face pixel detection
using evidential calibration and fusion. Int. J. Approximate Reasoning 91, 202–
215 (2017)
21. Sakaguchi, K., Post, M., Van Durme, B.: Efficient elicitation of annotations for
human evaluation of machine translation. In: Proceedings of the Ninth Workshop
on Statistical Machine Translation, Baltimore, pp. 1–11 (2014)
22. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Artificial Intelligence
and Statistics, Florida, pp. 448–455 (2009)
23. Salakhutdinov, R., Tenenbaum, J.B., Torralba, A.: Learning with hierarchical-deep
models. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1958–1971 (2012)
24. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
25. Tong, Z., Gao, J., Zhang, H.: Recognition, location, measurement, and 3D recon-
struction of concealed cracks using convolutional neural networks. Constr. Build.
Mater. 146, 775–787 (2017)
26. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing
robust features with denoising autoencoders. In: Proceedings of the 25th Interna-
tional Conference on Machine Learning, pp. 1096–1103, New York (2008)
27. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-
ing autoencoders: learning useful representations in a deep network with a local
denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
28. Xu, P., Davoine, F., Zha, H., Denœux, T.: Evidential calibration of binary SVM
classifiers. Int. J. Approximate Reasoning 72, 55–70 (2016)
29. Yager, R.R., Liu, L.: Classic Works of the Dempster-Shafer Theory of Belief Func-
tions, vol. 219. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-
44792-4
On Learning Evidential Contextual Corrections
from Soft Labels Using a Measure
of Discrepancy Between Contour Functions

Siti Mutmainah1,2(B) , Samir Hachour1 , Frédéric Pichon1 , and David Mercier1


1
EA 3926 LGI2A, Univ. Artois, 62400 Béthune, France
[email protected], {samir.hachour,frederic.pichon,
david.mercier}@univ-artois.fr
2
UIN Sunan Kalijaga, Yogyakarta, Indonesia
[email protected]

Abstract. In this paper, a proposition is made to learn the parameters of evi-


dential contextual correction mechanisms from a learning set composed of soft
labelled data, that is data where the true class of each object is only partially
known. The method consists in optimizing a measure of discrepancy between the
values of the corrected contour function and the ground truth also represented
by a contour function. The advantages of this method are illustrated by tests on
synthetic and real data.

Keywords: Belief functions · Contextual corrections · Learning · Soft labels

1 Introduction
In Dempster-Shafer theory [15, 17], the correction of a source of information, a sen-
sor for example, is classically done using the discounting operation introduced by
Shafer [15], but also by so-called contextual correction mechanisms [10, 13] taking into
account more refined knowledge about the quality of a source.
These mechanisms, called contextual discounting, negating and reinforcement [13],
can be derived from the notions of reliability (or relevance), which concerns the com-
petence of a source to answer the question of interest, and truthfulness [12, 13] indicat-
ing the source’s ability to say what it knows (it may also be linked with the notion of
bias of a source). The contextual discounting is an extension of the discounting opera-
tion, which corresponds to a partially reliable and totally truthful source. The contex-
tual negating is an extension of the negating operation [12, 13], which corresponds to
the case of a totally reliable but partially truthful source, the extreme case being the
negation of a source [5]. At last, the contextual reinforcement is an extension of the
reinforcement, a dual operation of the discounting [11, 13].
In this paper, the problem of learning the parameters of these correction mechanisms
from soft labels, meaning partially labelled data, is tackled. More specifically, in our
case, soft labels indicate the true class of each object in an imprecise manner through a
contour function.

c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 382–389, 2019.
https://doi.org/10.1007/978-3-030-35514-2_28
Learning Contextual Corrections from Soft Labels 383

A method for learning these corrections from labelled data (hard labels), where
the truth is perfectly known for each element of the learning set, has already been intro-
duced in [13]. It consists in minimizing a measure of discrepancy between the corrected
contour functions and the ground truths over elements of a learning set. In this paper,
it is shown that this same measure can be used to learn from soft labels, and tests on
synthetic and real data illustrate its advantages to (1) improve a classifier even if the
data is only partially labelled; and (2) obtain better performances than learning these
corrections from approximate hard labels approaching the only available soft labels.
This paper is organized as follows. In Sect. 2, the basic concepts and notations used
in this paper are presented. Then, in Sect. 3, the three applied contextual corrections
as well as their learning from hard labels are exposed. The proposition to extend this
method to soft labels is introduced. Tests of this method on synthetic and real data are
presented in Sect. 4. At last, a discussion and future works are given in Sect. 5.

2 Belief Functions: Basic Concepts Used


Only the basic concepts used are presented in this section (See for example [3, 15, 17]
for further details on the belief function framework).
From a frame of discernment Ω = {ω1 , ..., ωK }, a mass function
 (MF), noted mΩ
or m if no ambiguity, is defined from 2 to [0, 1], and verify A⊆Ω mΩ (A) = 1.
Ω

The focal elements of a MF m are the subsets A of Ω such that m(A) > 0.
A MF m is in one-to-one correspondence with a plausibility function P l defined for
all A ⊆ Ω by 
P l(A) = m(B). (1)
B∩A=∅

The contour function pl of a MF m is defined for all ω ∈ Ω by


pl : Ω → [0, 1]
(2)
ω → pl(ω) = P l({ω}) .
It is the restriction of the plausibility function to all the singletons of Ω.
The knowledge of the reliability of a source is classically taken into account by
the operation called discounting [15, 16]. Let us suppose a source S provides a piece
of information represented by a MF mS . With β ∈ [0, 1] the degree of belief of the
reliability of the source, the discounting of mS is defined by the MF m s.t.

m(A) = β mS (A) + (1 − β)mΩ (A) , (3)

for all A ⊆ Ω, where mΩ represents the total ignorance, i.e. the MF defined by
mΩ (Ω) = 1.
Several justifications for this mechanism can be found in [10, 13, 16].
The contour function of the MF m resulting from the discounting (3) is defined for
all ω ∈ Ω by (see for example [13, Prop. 11])

pl(ω) = 1 − (1 − plS (ω))β , (4)

with plS the contour function of mS .


384 S. Mutmainah et al.

3 Contextual Corrections and Learning from Labelled Data


In this Section, the contextual corrections we used are first exposed, then their learning
from hard labels. The proposition to extend this method to soft labels is then introduced.

3.1 Contextual Corrections of a Mass Function


For the sake of simplicity, we only recall here the contour functions expressions result-
ing from the applications of contextual discounting, reinforcement and negating mech-
anisms in the case of K contexts where K is the number of elements in Ω.
It is shown in [13] that these expressions are rich enough to minimize the discrep-
ancy measure used to learn the parameters of these corrections, this measure being
presented in Sect. 3.2.
Let us suppose a source S providing a piece of information mS .
The contour function resulting from the contextual discounting (CD) of mS and a
set of contexts composed of the singletons of Ω is given by
pl(ω) = 1 − (1 − plS (ω))β{ω} , (5)
for all ω ∈ Ω, with the K parameters β{ω} which may vary in [0, 1].
For the contextual reinforcement (CR) and the contextual negating (CN), the contour
functions are respectively given, from a set of contexts composed of the complementary
of each singleton of Ω, by
pl(ω) = plS (ω)β{ω} , (6)
and
pl(ω) = 0.5 + (plS (ω) − 0.5)(2β{ω} − 1) , (7)
for all ω ∈ Ω, with the K parameters β{ω} able to vary in [0, 1].

3.2 Learning from Hard Labels


Let us suppose a source of information providing a MF mS concerning the true class of
an object among a set of possible classes Ω.
If we have a learning set composed of n instances (or objects) the true values of
which are known, we can learn the parameters of a correction by minimizing a discrep-
ancy measure between the output of the classifier which is corrected (a correction is
applied to mS ) and the ground truth [7, 10, 13].
Introduced in [10], the following measure Epl yields a simple optimization problem
(a linear least-squares optimization problem, see [13, Prop. 12, 14 et 16]) to learn the
vectors β CD , β CR and β CN composed of the K parameters of corrections CD, CR
and CN:
 n K
Epl (β) = (pli (ωk ) − δi,k )2 , (8)
i=1 k=1
where pli is the contour function regarding the class of the instance i resulting from
a contextual correction (CD, CR or CN) of the MF provided by the source for this
instance, and δi,k is the indicator function of the truth of all the instances i ∈ {1, . . . , n},
i.e. δi,k = 1 if the class of the instance i is ωk , otherwise δi,k = 0.
Learning Contextual Corrections from Soft Labels 385

3.3 Learning from Soft Labels

In this paper, we consider the case where the truth is no longer given precisely by the
values δi,k , but only in an imprecise manner by a contour function δ̃i s.t.

δ̃i : Ω → [0, 1]
(9)
ωk → δ̃i (ωk ) = δ̃i,k .

The contour function δ̃i gives information about the true class in Ω of the instance i.
Knowing then the truth only partially, we propose to learn the corrections parame-
ters using the following discrepancy measure Ẽpl , extending directly (8):
n 
 K
Ẽpl (β) = (pli (ωk ) − δ̃i,k )2 . (10)
i=1 k=1

The discrepancy measure Ẽpl also yields, for each correction (CD, CR et CN), a linear
least-squares optimization problem. For example, for CD, Ẽpl can be written by

˜2
Ẽpl (β) = Qβ − d (11)

with
⎡ ⎤ ⎡ ⎤
diag(pl1 − 1) δ˜1 − 1
⎢ .. ⎥ ˜ ⎢ .. ⎥
Q=⎣ . ⎦, d = ⎣ . ⎦ (12)
diag(pln − 1) δ˜n − 1

where diag(v) is a square diagonal matrix whose diagonal is composed of the elements
of the vector v, and where for all i ∈ {1, . . . , n}, δ̃i is the column vector composed of
the values of the contour function δ̃i , meaning δ̃i = (δ̃i,1 , . . . , δ̃i,K )T .
In the following, this learning proposition is tested with generated and real data.

4 Tests on Generated and Real Data

We first expose how soft labels can be generated from hard labels to make the tests
exposed afterwards on synthetic and real data.

4.1 Generating Soft Labels from Hard Labels

It is not easy to find partially labelled data in the literature. Thus, as in [1, 8, 9, 14], we
have built our partially labelled data sets (soft labels) from perfect truths (hard labels)
using the procedure described in Algorithm 1 (where Bêta, B, and U means respec-
tively Bêta, Bernoulli and uniform distributions).
386 S. Mutmainah et al.

Algorithm 1. Soft labels generation


Input: hard labels δi with i ∈ {1, . . . , n}, where for each i, the integer k ∈ {1, . . . , K} s.t.
δi,k = 1 is denoted by ki .
Output: soft labels δ̃i with i ∈ {1, . . . , n}.
1: procedure H ARD T O S OFT L ABELS
2: for each instance i do
3: Draw pi ∼ Bêta(μ = .5, v = .04)
4: Draw bi ∼ B(pi )
5: if bi = 1 then
6: Draw ki ∼ U{1,...,K}
7: δ̃i,ki ← 1
8: δ̃i,k ← pi for all k = ki

Algorithm 1 allows one to obtain soft labels that are all the more imprecise as the
most plausible class is false.

4.2 Tests Performed


The chosen evidential classifier used as a source of information is the eviential k-nearest
neighbor classifier (EkNN) introduced by Denœux in [2] with k = 3. We could have
chosen another one with other settings, it can be seen as a black box.
The first test set we consider is composed of synthetic data composed of 3 classes
built from 3 bivariate normal distributions with respective means μω1 = (1, 2), μω2 =
(2, 1) and μω3 = (0, 0), and a common covariance matrix Σ s.t.


1 0.5
Σ= . (13)
0.5 1
For each class, 100 instances have been generated. They are illustrated in Fig. 1.

Fig. 1. Illustration of the generated dataset (3 classes, 2 attributes).

We have then considered several real data sets from the UCI database [6] composed
of numerical attributes as the EkNN classifier is used. Theses data sets are described in
Table 1.
Learning Contextual Corrections from Soft Labels 387

Table 1. Characteristics of the UCI dataset used (number of instances without missing data,
number of classes, number of numerical attributes used)

Data #Instances #Classes #Attributes


Ionosphere 350 2 34
Iris 150 3 4
Sonar 208 2 60
Vowel 990 11 9
Wine 178 3 13

For each dataset, a 10-repeated 10-fold cross validation has been undertaken as
follows:
– the group containing one tenth of the data is considered as the test set (the instances
labels being made imprecise using Algorithm 1),
– the other 9 groups form the learning set, which is randomly divided into two groups
of equal size:
• one group to learn the EkNN classifier (learnt from hard truths),
• one group to learn the parameters of the correction mechanisms from soft labels
(the labels of the dataset are made imprecise using Algorithm 1).
For learning the parameters of contextual corrections, two strategies are compared.
1. In the first strategy, we use the optimization of Eq. (8) from the closest hard truths
from the soft truths (the most plausible class is chosen). Corrections with this strat-
egy are denoted by CD, CR and CN.
2. In the second strategy, Eq. (10) is directly optimized from soft labels (cf Sect. 3.3).
The resulting corrections using this second strategy are denoted by CDsl, CRsl and
CNsl.
The performances of the systems (the classifier alone and the corrections - CD, CR
or CN - of this classifier according to the two strategies described above) are measured
using Ẽpl (10), where δ̃ represents the partially known truth. This measure corresponds
to the sum over the test instances of the differences, in the least squares sense, between
the truths being sought and the system outputs.
The performances Ẽpl (10) obtained from UCI and generated data for the classi-
fier and its corrections are summed up in Table 2 for each type of correction. Standard
deviations are indicated in brackets.
From the results presented in Table 2, we can remark that, for CD, the second strat-
egy (CDsl) consisting in learning directly from the soft labels, allows one to obtain
lower differences Ẽpl from the truth on the test set than the first strategy (CD) where
the correction parameters are learnt from approximate hard labels. We can also remark
that this strategy yields lower differences Ẽpl than the classifier alone, illustrating, in
these experiments, the usefulness of soft labels even if hard labels are not available,
which can be interesting in some applications.
The same conclusions can be drawn for CN.
388 S. Mutmainah et al.

Table 2. Performances Ẽpl obtained for the classifier alone and the classifier corrected with CD,
CR and CN using both strategies. Standard deviations are indicated in brackets.

Data EkNN CD CDsl CR CRsl CN CNsl


Generated data 23.8 (3.8) 16.6 (2.8) 7.9 (1.5) 26.8 (3.0) 23.5 (3.7) 11.5 (1.6) 9.8 (0.6)
Ionosphere 16.2 (2.5) 9.6 (2.2) 5.3 (1.0) 17.2 (1.9) 15.9 (2.3) 9.3 (1.3) 8.4 (0.9)
Iris 12.5 (2.4) 8.4 (2.1) 3.3 (0.9) 13.1 (2.0) 12.3 (2.2) 6.7 (1.5) 4.8 (0.5)
Sonar 7.8 (2.0) 6.3 (1.9) 3.5 (0.9) 9.0 (1.6) 7.7 (1.9) 5.1 (0.8) 5.0 (0.9)
Vowel 279 (24) 278 (23) 62 (5) 310 (21) 279 (24) 240 (21) 65 (5)
Wine 13.3 (2.6) 10.4 (2.3) 4.3 (1.0) 15.0 (2.1) 13.3 (2.5) 7.2 (1.6) 5.7 (0.6)

For CR, the second strategy is also better than the first one but we can note that
unlike the other corrections, there is no improvement for the first strategy in comparison
to the classifier alone (the second strategy having also some close performances to the
classifier alone).

5 Discussion and Future Works


We have shown that contextual corrections may lead to improved performances in the
sense of measure Ẽpl , which relies on the plausibility values returned by the systems for
each class for each instance. We also note that by using the same experiments as those in
Sect. 4.2 but evaluating the performances using a simple 0–1 error criterion, where for
each instance the most plausible class is compared to the true class, the performances
remain globally identical for the classifier alone as well as all the corrections (the most
plausible class being often the same for the classifier and each correction).
For future works, we are considering the use of other performance measures, which
would also take fully into account the uncertainty and the imprecision of the outputs.
For example, we would like to study those introduced by Zaffalon et al. [18].
It would also be possible to test other classifiers than the EkNN. We could also test
the advantage of these correction mechanisms in classifiers fusion problems.
At last, we also intend to investigate the learning from soft labels using another
measure than Ẽpl and in particular the evidential likelihood introduced by Denœux [4]
and already used to develop a CD-based EkNN [9].

Acknowledgement. The authors would like to thank the anonymous reviewers for their helpful
and constructive comments, which have helped them to improve the quality of the paper and to
consider new paths for future research.
Mrs. Mutmainah’s research is supported by the overseas 5000 Doctors program of Indonesian
Religious Affairs Ministry (MORA French Scholarship).

References
1. Côme, E., Oukhellou, L., Denœux, T., Aknin, P.: Learning from partially supervised data
using mixture models and belief functions. Pattern Recogn. 42(3), 334–348 (2009)
Learning Contextual Corrections from Soft Labels 389

2. Denoeux, T.: A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE
Trans. Syst. Man Cybern. 25(5), 804–813 (1995)
3. Denœux, T.: Conjunctive and disjunctive combination of belief functions induced by nondis-
tinct bodies of evidence. Artif. Intell. 172, 234–264 (2008)
4. Denœux, T.: Maximum likelihood estimation from uncertain data in the belief function
framework. IEEE Trans. Knowl. Data Eng. 25(1), 119–130 (2013)
5. Dubois, D., Prade, H.: A set-theoretic view of belief functions: logical operations and approx-
imations by fuzzy sets. Int. J. Gen. Syst. 12(3), 193–226 (1986)
6. Dua, D., Graff, C.: UCI Machine Learning Repository. School of Information and Computer
Science, University of California, Irvine (2019). http://archive.ics.uci.edu/ml
7. Elouedi, Z., Mellouli, K., Smets, P.: The evaluation of sensors’ reliability and their tuning
for multisensor data fusion within the transferable belief model. In: Benferhat, S., Besnard,
P. (eds.) ECSQARU 2001. LNCS (LNAI), vol. 2143, pp. 350–361. Springer, Heidelberg
(2001). https://doi.org/10.1007/3-540-44652-4 31
8. Kanjanatarakul, O., Kuson, S., Denoeux, T.: An evidential K-nearest neighbor classifier
based on contextual discounting and likelihood maximization. In: Destercke, S., Denoeux,
T., Cuzzolin, F., Martin, A. (eds.) BELIEF 2018. LNCS (LNAI), vol. 11069, pp. 155–162.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99383-6 20
9. Kanjanatarakul, O., Kuson, S., Denœux, T.: A new evidential k-nearest neighbor rule based
on contextual discounting with partially supervised learning. Int. J. Approx. Reason. 113,
287–302 (2019)
10. Mercier, D., Quost, B., Denœux, T.: Refined modeling of sensor reliability in the belief func-
tion framework using contextual discounting. Inf. Fusion 9(2), 246–258 (2008)
11. Mercier, D., Lefèvre, E., Delmotte, F.: Belief functions contextual discounting and canonical
decompositions. Int. J. Approx. Reason. 53(2), 146–158 (2012)
12. Pichon, F., Dubois, D., Denoeux, T.: Relevance and truthfulness in information correction
and fusion. Int. J. Approx. Reason. 53(2), 159–175 (2012)
13. Pichon, F., Mercier, D., Lefèvre, E., Delmotte, F.: Proposition and learning of some belief
function contextual correction mechanisms. Int. J. Approx. Reason. 72, 4–42 (2016)
14. Quost, B., Denoeux, T., Li, S.: Parametric classification with soft labels using the eviden-
tial EM algorithm: linear discriminant analysis versus logistic regression. Adv. Data Anal.
Classif. 11(4), 659–690 (2017)
15. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press, Princeton
(1976)
16. Smets, P.: Belief functions: the disjunctive rule of combination and the generalized Bayesian
theorem. Int. J. Approx. Reason. 9(1), 1–35 (1993)
17. Smets, P., Kennes, R.: The transferable belief model. Artif. Intell. 66(2), 191–234 (1994)
18. Zafallon, M., Corani, G., Mauá, D.-D.: Evaluating credal classifiers by utility-discounted
predictive accuracy. Int. J. Approx. Reason. 53(8), 1282–1301 (2012)
Efficient Möbius Transformations
and Their Applications to D-S Theory

Maxime Chaveroche(B) , Franck Davoine , and Véronique Cherfaoui

Alliance Sorbonne Université, Université de Technologie de Compiègne, CNRS,


Laboratoire Heudiasyc, 57 Avenue de Landshut, 60200 Compiègne, France
{maxime.chaveroche,franck.davoine,veronique.cherfaoui}@hds.utc.fr

Abstract. Dempster-Shafer Theory (DST) generalizes Bayesian prob-


ability theory, offering useful additional information, but suffers from a
high computational burden. A lot of work has been done to reduce the
complexity of computations used in information fusion with Dempster’s
rule. The main approaches exploit either the structure of Boolean lat-
tices or the information contained in belief sources. Each has its merits
depending on the situation. In this paper, we propose sequences of graphs
for the computation of the zeta and Möbius transformations that opti-
mally exploit both the structure of distributive lattices and the infor-
mation contained in belief sources. We call them the Efficient Möbius
Transformations (EMT). We show that the complexity of the EMT is
always inferior to the complexity of algorithms that consider the whole
lattice, such as the Fast Möbius Transform (FMT) for all DST trans-
formations. We then explain how to use them to fuse two belief sources.
More generally, our EMTs apply to any function in any finite distributive
lattice, focusing on a meet-closed or join-closed subset.

Keywords: Zeta transform · Möbius transform · Distributive lattice ·


Meet-closed subset · Join-closed subset · Fast Möbius Transform ·
FMT · Dempster-Shafer Theory · DST · Belief functions · Efficiency ·
Information-based · Complexity reduction

1 Introduction
Dempster-Shafer Theory (DST) [11] is an elegant formalism that generalizes
Bayesian probability theory. It is more expressive by giving the possibility for
a source to represent its belief in the state of a variable not only by assigning
credit directly to a possible state (strong evidence) but also by assigning credit to
any subset (weaker evidence) of the set Ω of all possible states. This assignment
of credit is called a mass function and provides meta-information to quantify
This work was carried out and co-funded in the framework of the Labex MS2T and
the Hauts-de-France region of France. It was supported by the French Government,
through the program “Investments for the future” managed by the National Agency
for Research (Reference ANR-11-IDEX-0004-02).
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 390–403, 2019.
https://doi.org/10.1007/978-3-030-35514-2_29
Efficient Möbius Transformations and Their Applications to D-S Theory 391

the level of uncertainty about one’s believes considering the way one established
them, which is critical for decision making.
Nevertheless, this information comes with a cost: considering 2|Ω| potential
values instead of only |Ω| can lead to computationally and spatially expensive
algorithms. They can become difficult to use for more than a dozen possible states
(e.g. 20 states in Ω generate more than a million subsets), although we may need
to consider large frames of discernment (e.g. for classification or identification).
Moreover, these algorithms not being tractable anymore beyond a few dozen
states means their performances greatly degrade before that, which further limits
their application to real-time applications. To tackle this issue, a lot of work has
been done to reduce the complexity of transformations used to combine belief
sources with Dempster’s rule [6]. We distinguish between two approaches that
we call powerset-based and evidence-based.
The powerset-based approach concerns all algorithms based on the structure
of the powerset 2Ω of the frame of discernment Ω. They have a complexity
dependent on |Ω|. Early works [1,7,12,13] proposed optimizations by restricting
the structure of evidence to only singletons and their negation, which greatly
restrains the expressiveness of the DST. Later, a family of optimal algorithms
working in the general case, i.e. the ones based on the Fast Möbius Transform
(FMT) [9], was discovered. Their complexity is O(|Ω|.2|Ω| ) in time and O(2|Ω| )
in space. It has become the de facto standard for the computation of every
transformation in DST. Consequently, efforts were made to reduce the size of Ω
to benefit from the optimal algorithms of the FMT. More specifically, [14] refers
to the process of conditioning by the combined core (intersection of the unions of
all focal sets of each belief source) and lossless coarsening (merging of elements
of Ω which always appear together in focal sets). Also, Monte Carlo methods
[14] have been proposed but depend on a number of trials that must be large
and grows with |Ω|, in addition to not being exact.
The evidence-based approach concerns all algorithms that aim to reduce the
computations to the only subsets that contain information (evidence), called
focal sets and usually far less numerous than 2|Ω| . This approach, also refered
as the obvious one, implicitly originates from the seminal work of Shafer [11]
and is often more efficient than the powerset-based one since it only depends
on information contained in sources in a quadratic way. Doing so, it allows
for the exploitation of the full potential of DST by enabling us to choose any
frame of discernment, without concern about its size. Moreover, the evidence-
based approach benefits directly from the use of approximation methods, some
of which are very efficient [10]. Therefore, this approach seems superior to the
FMT in most use cases, above all when |Ω| is large, where an algorithm with
exponential complexity is just intractable.
It is also possible to easily find evidence-based algorithms computing all
DST transformation, except for the conjunctive and disjunctive decompositions
for which we recently proposed a method [4].
However, since these algorithms rely only on the information contained
in sources, they do not exploit the structure of the powerset to reduce the
392 M. Chaveroche et al.

complexity, leading to situations in which the FMT can be more efficient if


almost every subset contains information, i.e. if the number of focal sets tends
towards 2|Ω| [14], all the most when no approximation method is employed.
In this paper, we fuse these two approaches into one, proposing new sequences
of graphs, in the same fashion as the FMT, that are always more efficient than
the FMT and can in addition benefit from evidence-based optimizations. We
call them the Efficient Möbius Transformations (EMT). More generally, our
approach applies to any function defined on a finite distributive lattice.
Outside the scope of DST, [2] is related to our approach in the sense that
we both try to remove redundancy in the computation of the zeta and Möbius
transforms on the subset lattice 2Ω . However, they only consider the redundancy
of computing the image of a subset that is known to be null beforehand. To do
so, they only visit sets that are accessible from the focal sets of lowest rank
by successive unions with each element of Ω. Here, we demonstrate that it is
possible to avoid far more computations by reducing them to specific sets so that
each image is only computed once. These sets are the focal points described in
[4]. The study of their properties will be carried out in depth in an upcoming
article [5]. Besides, our method is more general since it applies to any finite
distributive lattice.
Furthermore, an important result of our work resides in the optimal compu-
tation of the zeta and Möbius transforms in any intersection-closed family F of
sets from 2Ω , i.e. with a complexity O(|Ω|.|F |). Indeed, in the work of [3] on
the optimal computation of these transforms in any finite lattice L, they embed-
ded L into the Boolean lattice 2Ω , obtaining an intersection-closed family F as
its equivalent, and found a meta-procedure building a circuit of size O(|Ω|.|F |)
computing the zeta and Möbius transforms. However, they did not managed to
build this circuit in less than O(|Ω|.2|Ω| ). Given F , our Theorem 2 in this paper
directly computes this circuit in O(|Ω|.|F |), while being much simpler.
This paper is organized as follows: Sect. 2 will present the elements on which
our method is built. Section 3 will present our EMT. Section 4 will discuss their
complexity and their usage in DST. Finally, we will conclude this article with
Sect. 5.

2 Background of Our Method


Let (P, ≤) be a finite1 set partially ordered by ≤.
Zeta Transform. The zeta transform g : P → R of a function f : P → R is
defined as follows:

∀y ∈ P, g(y) = f (x)
x≤y

1
The following definitions hold for lower semifinite partially ordered sets as well, i.e.
partially ordered sets such that the number of elements of P lower in the sense of ≤
than another element of P is finite. But for the sake of simplicity, we will only talk
of finite partially ordered sets.
Efficient Möbius Transformations and Their Applications to D-S Theory 393

For example, the commonality function q (resp. the implicability function b) in


DST is the zeta transform of the mass function m for (2Ω , ⊇) (resp. (2Ω , ⊆)).
Möbius Transform. The Möbius transform of g is f . It is defined as follows:

∀y ∈ P, f (y) = g(x).μ(x, y) (1)
x≤y

where μ is the Möbius function of P .


There is also a multiplicative version with the same properties in which the
sum is replaced by a product. An example of this version would be the inverse
of the conjunctive (resp. disjunctive) weight function in DST which is the mul-
tiplicative Möbius transform of the commonality (resp. implicability) function.

2.1 Sequence of Graphs and Computation of the Zeta Transform

Consider a procedure A : (RP , GP,≤ , {+, −, ·, /}) → RP , where RP is the set


of functions of domain P and range R, and GP,≤ is the set of acyclic directed
graphs in which every node is in P and every arrow is a pair (x, y) ∈ P 2 such that
x ≤ y. For any such function m and graph G, the procedure A(m, G, +) outputs
a function z such that, for every y ∈ P , z(y) is the sum of every m(x) where (x, y)
is an arrow of G. We define its reverse procedure as A(z, G, −), which outputs
the function m such that, for every y ∈ P , m (y) is the sum, for every arrow
(x, y) of G, of z(x) if x = y, and −z(x) otherwise. If the arrows of G represent all
pairs of P ordered by ≤, then A(m, G, +) computes the zeta transform z of m.
Note however that A(z, G, −) does not output the Möbius transform m of z. For
that, G has to be broken down into a sequence of subgraphs (e.g. one subgraph
per rank of y, in order of increasing rank).
Moreover, the upper bound complexity of these procedures, if G represent
all pairs of P ordered by ≤, is O(|P |2 ). Yet, it is known that the optimal upper
bound complexity of the computation of the zeta and Möbius transforms if P
is a finite lattice is O(|∨ I(P )|.|P |) (see [3]). Thus, a decomposition of these
procedures should lead to a lower complexity at least in this case.
For this, Theorem 3 of [9] defines a necessary and sufficient condition to verify
that A(A(. . . (A(m, H1 , +), . . . ), Hk−1 , +), Hk , +) = A(m, G≤ , +), where Hi is the
i-th directed acyclic graph of a sequence H of size k, and G≤ = {(x, y) ∈ P 2 /x ≤
y}. For short, it is said in [9] that H computes the Möbius transformation of G≤ .
Here, in order to dissipate any confusion, we will say instead that H computes
the zeta transformation of G≤ .
It is stated in our terms as follows: H computes the zeta transformation of
G≤ if and only if every arrow from each Hi is in G≤ and every arrow g from G≤
can be decomposed as a unique path (g1 , g2 , . . . , g|H| ) ∈ H1 × H2 × · · · × H|H| ,
i.e. such that the tail of g is the one of g1 , the head of g is the one of g|H| , and
∀i ∈ {1, . . . , |H| − 1}, the head of gi is the tail of gi+1 .
394 M. Chaveroche et al.

∅ {a} {b} {a, b} {c} {a, c} {b, c} Ω


• • • • • • • •
x ∪ {a} → H1
• • • • • • • •
x ∪ {b} → H2
• • • • • • • •
x ∪ {c} → H3
• • • • • • • •

Fig. 1. Illustration representing the arrows contained in the sequence H computing the
zeta transformation of G⊆ = {(X, Y ) ∈ 2Ω × 2Ω /X ⊆ Y }, where Ω = {a, b, c}. For the
sake of clarity, identity arrows are not displayed. This representation is derived from
the one used in [9].

Application to the Boolean Lattice 2Ω (FMT). Let Ω = {ω1 , ω2 , . . . , ωn }.


The sequence H of graphs Hi computes the zeta transformation of G⊆ =
{(X, Y ) ∈ 2Ω × 2Ω /X ⊆ Y } if:

Hi = {(X, Y ) ∈ 2Ω × 2Ω /Y = X or Y = X ∪ {ωi }},

where i ∈ {1, . . . , n}. Figure 1 illustrates the sequence H.


Dually, the sequence H of graphs H i computes the zeta transformation of
G⊇ = {(X, Y ) ∈ 2Ω × 2Ω /X ⊇ Y } if:

H i = {(X, Y ) ∈ 2Ω × 2Ω /X = Y or X = Y ∪ {ωi }}.

The sequences of graphs H and H are the foundation of the FMT algorithms.
Their execution is O(n.2n ) in time and O(2n ) in space.

2.2 Sequence of Graphs and Computation of the Möbius Transform

Now, consider that we have a sequence H computing the zeta transformation of


G≤ . It easy to see that the procedure A(. . . (A(A(z, Hk , −), Hk−1 , −), . . . ), H1 , −)
deconstructs z = A(A(. . . (A(m, H1 , +), . . . ), Hk−1 , +), Hk , +), revisiting every
arrow in H, as required to compute the Möbius transformation. But, to actually
compute the Möbius transformation and get m back with H and A, we have to
make sure that the images of z that we add through A do not bear redundancies
(e.g. if H is the sequence that only contains G≤ , then H does compute the
Möbius transformation of G≤ with Eq. 1, but not with A). For this, we only
have to check that for each arrow (x, y) in G≤ , there exists at most one path
(g1 , . . . , gp ) ∈ Hi1 × · · · × Hip where p ∈ N∗ and ∀j ∈ {1, . . . , p − 1}, 1 ≤ ij ≤
ij+1 ≤ ij + 1 ≤ |H| and either tail(gj ) = head(gj ) or ij−1 < ij < ij+1 (i.e. which
moves right or down in Fig. 1). With this, we know that we do not subtract two
images z1 and z2 to a same z3 if one of z1 and z2 is supposed to be subtracted
from the other beforehand. In the end, it is easy to see that, if for each graph
Hi , all element y ∈ P such that (x, y) ∈ Hi and (y, y  ) ∈ Hi where x = y verifies
y  = y (i.e. no “horizontal” path of more than one arrow in each Hi ), then
Efficient Möbius Transformations and Their Applications to D-S Theory 395

the condition is already satisfied by the one of Sect. 2.1. So, if this condition is
satisfied, we will say that H computes the Möbius transformation of G≤ .
Application to the Boolean Lattice 2Ω (FMT). Resuming the application
of Sect. 2.1, for all X ∈ 2Ω , if ωi ∈ X, then there is an arrow (X, Y ) in Hi where
Y = X ∪ {ωi } and X = Y , but then for any set Y  such that (Y, Y  ) ∈ Hi , we
have Y  = Y ∪ {ωi } = Y . Conversely, if ωi ∈ X, then the arrow (X, X ∪ {ωi })
is in Hi , but its head and tail are equal. Thus, H also computes the Möbius
transformation of G⊆ .

2.3 Order Theory


Irreducible Elements. We note  I(P ) the set of join-irreducible elements of
P , i.e. the elements i such that i = P for which it holds that ∀x, y ∈ P , if x < i
and y < i, then x ∨ y < i. Dually, we note ∧ I(P  ) the set of meet-irreducible
elements of P , i.e. the elements i such that i = P for which it holds that
∀x, y ∈ P , if x > i and y > i, then x ∧ y > i. For example, in the Boolean lattice
2Ω , the join-irreducible elements are the singletons {ω}, where ω ∈ Ω.
If P is a finite lattice, then every element of P is the join of join-irreducible
elements and the meet of meet-irreducible elements.
Support of a Function in P . The support supp(f ) of a function f : P → R
is defined as supp(f ) = {x ∈ P/f (x) = 0}.
For example, in DST, the set of focal elements of a mass function m is supp(m).

2.4 Focal Points

For any function f : P → R, we note ∧ supp(f ) (resp. ∨ supp(f )) the smallest


meet-closed (resp. join-closed) subset of P containing supp(f ), i.e.:


supp(f ) = {x/∃S ⊆ supp(f ), S = ∅, x = s}
s∈S


supp(f ) = {x/∃S ⊆ supp(f ), S = ∅, x = s}
s∈S

The set of focal points F̊ of a mass function m from [4] for the conjunctive weight
function is ∧ supp(m). For the disjunctive one, it is ∨ supp(m).
It has been proven in [4] that the image of 2Ω through the conjunctive weight
function can be computed without redundancies by only considering the focal
points ∧ supp(m) in the definition of the multiplicative Möbius transform of
the commonality function. The image of all set in 2Ω \∧ supp(m) through the
conjunctive weight function is 1. The same can be stated for the disjunctive
weight function regarding the implicability function and ∨ supp(m). In the same
way, the image of any set in 2Ω \∧ supp(m) through the commonality function
is only a duplicate of the image of a set in ∧ supp(m) and can be recovered by
searching for its smallest superset in ∧ supp(m). In fact, as generalized in an
upcoming article [5], for any function f : P → R, ∧ supp(f ) is sufficient to define
396 M. Chaveroche et al.

its zeta and Möbius transforms based on the partial order ≥, and ∨ supp(f ) is
sufficient to define its zeta and Möbius transforms based on the partial order ≤.
However, considering the case where P is a finite lattice, naive algorithms
that only consider ∧ supp(f ) or ∨ supp(f ) have upper bound complexities in
O(|∧ supp(f )|2 ) or O(|∨ supp(f )|2 ), which may be worse than the optimal com-
plexity O(|∨ I(P )|.|P |) for a procedure that considers the whole lattice P . In this
paper, we propose algorithms with complexities always less than O(|∨ I(P )|.|P |)
computing the image of a meet-closed (e.g. ∧ supp(f )) or join-closed (e.g.

supp(f )) subset of P through the zeta or Möbius transform, provided that
P is a finite distributive lattice.

3 Our Efficient Möbius Transformations


In this section, we consider a function f : P → R where P is a finite distribu-
tive lattice (e.g. the Boolean lattice 2Ω ). We present here our Efficient Möbius
Transformations as Theorems 1 and 2. The first one describes a way of com-
puting the zeta and Möbius transforms of a function based on the smallest sub-
lattice L supp(f ) of P containing both ∧ supp(f ) and ∨ supp(f ), which is defined
in Proposition 2. The second one goes beyond this optimization by computing
these transforms based only on ∧ supp(f ) or ∨ supp(f ). Nevertheless, this second
approach requires the direct computation of ∧ supp(f ) or ∨ supp(f ), which has an
upper bound complexity of O(|supp(f )|.|∧ supp(f )|) or O(|supp(f )|.|∨ supp(f )|),
which may be more than O(|∨ I(P )|.|P |) if |supp(f )|  |∨ I(P )|.

Lemma 1 (Safe join). Let us consider a finite distributive lattice L. For all
i ∈ ∨ I(L) and for all x, y ∈ L such that i ≤ x and i ≤ y, we have i ≤ x ∨ y.

Proof. By definition of a join-irreducible element, we know that ∀i ∈ ∨ I(L)


and for all a, b ∈ L, if a < i and b < i, then a ∨ b < i. Moreover, for all x, y ∈ L
such that i ≤ x and i ≤ y, we have equivalently i ∧ x < i and i ∧ y < i. Thus, we
get that (i ∧ x) ∨ (i ∧ y) < i. Since L satisfies the distributive law, this implies
that (i ∧ x) ∨ (i ∧ y) = i ∧ (x ∨ y) < i, which means that i ≤ x ∨ y.

Proposition 1 (Iota elements of subsets of P). For any S ⊆ P , the join-


irreducible elements of the smallest sublattice LS of P containing S are:
 
ι(S) = {s ∈ S/s ≥ i}/i ∈ ∨ I(P ) and ∃s ∈ S, s ≥ i .

 it can be easily shown that the meet of any two elements of ι(S)
Proof. First,
is either S or in ι(S). Then, suppose thatwe generate LS with the join of
elements of ι(S), to which we add the element S. Then,since P is distributive,
we have that for all x, y ∈ LS , their meet x ∧ y is either S or equal to the join
of every meet of pairs (iS,x , iS,y ) ∈ ι(S)2 , where iS,x ≤ x and iS,y ≤ y. Thus,
x ∧ y ∈ LS , which implies that LS is a sublattice of P . In addition, notice that
for each nonzero element s ∈ S and for all i ∈∨ I(P ) such that s ≥ i, we also
have by construction s ≥ iS ≥ i, where iS = {s ∈ S/s ≥ i}. Therefore, we
Efficient Möbius Transformations and Their Applications to D-S Theory 397

 ∨

 s = {i ∈ I(P )/s ≥ i} = {i ∈ ι(S)/s ≥ i}, i.e. s ∈ LS . Besides,
have
if P ∈ S, then it is equal to S, which is also in LS by construction. So,
S ⊆ LS . It follows that the meet or join of every nonempty subset of S is in
LS , i.e. MS ⊆ LS and JS ⊆ LS , where MS is the smallest meet-closed subset
of P containing S and JS is the smallest join-closed subset of P containing S.
Furthermore, ι(S) ⊆ MS which means that we cannot build a smaller sublattice
of P containing S. Therefore, LS is the smallest sublattice of P containing S.
Finally, for any i ∈ ∨ I(P ) such that ∃s ∈ S, s ≥ i, we note iS = {s ∈
S/s ≥ i}. For all x, y ∈ LS , if iS > x and iS > y, then by construction of
ι(S), we have i ≤ x and i ≤ y (otherwise, iS would be less than x or y), which
implies by Lemma 1 that i ≤ x ∨ y. Since i ≤ iS , we have necessarily iS > x ∨ y.
Therefore, iS is a join-irreducible element of LS .
Proposition 2 (Lattice support). The smallest sublattice of P containing
both ∧ supp(f ) and ∨ supp(f ), noted L supp(f ), can be defined as:
   
L
supp(f ) = X/X ⊆ ι(supp(f )), X = ∅ ∪ supp(f ) .

More specifically, ∨ supp(f ) is contained in the upper closure L,↑


supp(f ) of
supp(f ) in L supp(f ):
L,↑
supp(f ) = {x ∈ L supp(f )/∃s ∈ supp(f ), s ≤ x},
and ∧ supp(f ) is contained in the lower closure L,↓ supp(f ) of supp(f ) in
L
supp(f ):
L,↓
supp(f ) = {x ∈ L supp(f )/∃s ∈ supp(f ), s ≥ x}.
These sets can be computed in less than respectively O(|ι(supp(f ))|.|L,↑
supp(f )|) and O(|ι(supp(f ))|.|L,↓ supp(f )|), which is at most O(|∨ I(P )|.|P |).
Proof. The proof is immediate here, considering Proposition 1 and its proof.
In addition, since ∧ supp(f ) only contains the meet of elements of supp(f ), all
element of ∧ supp(f ) is less than at least one element of supp(f ). Similarly, since

supp(f ) only contains the join of elements of supp(f ), all element of ∨ supp(f )
is greater than at least one element of supp(f ). Hence L,↓ supp(f ) and L,↑ supp(f ).
As pointed out in [8], a special ordering of the join-irreducible elements of a
lattice when using the Fast Zeta Transform [3] leads to the optimal computation
of its zeta and Möbius transforms. Here, we use this ordering to build our EMT
for finite distributive lattices in a way similar to [8] but without the need to check
the equality of the decompositions into the first j join-irreducible elements at
each step.
Corollary 1 (Join-irreducible ordering). Let us consider a finite distribu-
tive lattice L and let its join-irreducible elements ∨ I(L) be ordered such that
∀ik , il ∈ ∨ I(L), k < l ⇒ ik ≥ il . We note ∨ I(L)  k = {i1 , . . . , ik−1 , ik }.
For all element ik ∈ ∨ I(L), we have ik ≤ ∨ I(L)k−1 .
If L is a graded lattice (i.e. a lattice equipped with a rank function ρ : L → N),
then ρ(i1 ) ≤ ρ(i2 ) ≤ · · · ≤ ρ(i|∨ I(L)| ) implies this ordering. For example, in DST,
P = 2Ω , so for all A ∈ P , ρ(A) = |A|.
398 M. Chaveroche et al.

∅ {a} {d} {a,d} {c,d,f} {a,c,d,f} Ω


• • • • • • •
x ∪ Ω → H1
• • • • • • •
x ∪ {c, d, f } → H2
• • • • • • •
x ∪ {d} → H3
• • • • • • •
x ∪ {a} → H4
• • • • • • •

Fig. 2. Illustration representing the arrows contained in the sequence H when com-
puting the zeta transformation of G⊆ = {(x, y) ∈ L2 /x ⊆ y}, where L =
{∅, {a}, {d}, {a, d}, {c, d, f }, {a, c, d, f }, Ω} with Ω = {a, b, c, d, e, f } and ∨ I(L) =
{{a}, {d}, {c, d, f }, Ω}. For the sake of clarity, identity arrows are not displayed.

Proof. Since the join-irreducible elements are ordered such that ∀ik , il ∈ ∨ I(L),
k < l ⇒ ik ≥ il , it is trivial to see that for any il ∈ ∨ I(L) and ik ∈ ∨ I(L)l−1 ,
 ik ≥ il . Then, using Lemma 1 by recurrence, it is easy to get that
we have
il ≤ ∨ I(L)l−1 .

Theorem 1 (Efficient Möbius Transformation in a distributive lat-


tice). Let us consider a finite distributive lattice L (such as L supp(f )) and
let its join-irreducible elements ∨ I(L) be ordered such that ∀ik , il ∈ ∨ I(L),
k < l ⇒ ik ≥ il . We note n = |∨ I(L)|.
The sequence H of graphs Hk computes the zeta and Möbius transformations
of G≤ = {(x, y) ∈ L2 /x ≤ y} if:

Hk = (x, y) ∈ L2 /y = x or y = x ∨ ik ,

where k = n + 1 − k. This sequence is illustrated in Fig. 2. Its execution is


O(n.|L|).
Dually, the sequence H of graphs H k computes the zeta and Möbius trans-
formations of G≥ = {(x, y) ∈ L2 /x ≥ y} if:

H k = (x, y) ∈ L2 /x = y or x = y ∨ ik .

Proof. By definition, for all k and ∀(x, y) ∈ Hk , we have x, y ∈ L and x ≤ y,


i.e. (x, y) ∈ G≤ . Reciprocally, ∀(x, y) ∈ G≤ , we have x ≤ y, which can be
decomposed as a unique path (g1 , g2 , . . . , gn ) ∈ H1 × H2 × · · · × Hn :
Similarly to the FMT, the sequence H builds unique paths simply by gen-
erating the whole lattice step by step with each join-irreducible element of L.
However, unlike the FMT, the join-irreducible elements of L are not necessarily
atoms. Doing so, pairs of join-irreducible elements may be ordered, causing the
sequence H to skip or double some elements. And even if all the join-irreducible
elements of L are atoms, since L is not necessarily a Boolean lattice, the join of
two atoms may be greater than a third atom (e.g. if L is the diamond lattice),
Efficient Möbius Transformations and Their Applications to D-S Theory 399

leading to the same issue. Indeed, to build a unique path between two elements
x, y of L such that x ≤ y, we start from x. Then at step 1, we get to the join x∨in
if in ≤ y (we stay at x otherwise, i.e. identity arrow), then we get to x∨in ∨in−1
if in−1 ≤ y, and so on until we get to y. However, if we have in ≤ x ∨ in−1 , with
in ≤ x, then there are at least two paths from x to y: one passing by the join
with in at step 1 and one passing by the identity arrow instead.
More  generally, this kind of issue may only appear if there is a k where
ik ≤ x ∨ ∨ I(L)k−1 with ik ≤ x, where ∨ I(L)k−1 = {ik−1 , ik−2 , . . . , i1 }. But,
since L is a finite distributive lattice, and since its join-irreducible elements are

ordered such ∨that ∀ij , il ∈ I(L), j < l ⇒ ij ≥ il , we have by Corollary 1
thatik ≤ I(L)k−1 . So, if ik ≤ x, then by Lemma 1, we also have ik ≤
x ∨ ∨ I(L)k−1 . Thereby, there is a unique path from x to y, meaning that the
condition of Sect. 2.1 is satisfied. H computes the zeta transformation of G≤ .
Also, ∀x ∈ L, if ik ≤ x, then there is an arrow (x, y) in Hk where y = x ∨ ik
and x = y, but then for any element y  such that (y, y  ) ∈ Hk , we have y  =
y ∨ ik = y. Conversely, if ik ≤ x, then the arrow (x, x ∨ ik ) is in Hk , but its head
and tail are equal. Thus, the condition of Sect. 2.2 is satisfied. H also computes
the Möbius transformation of G≤ .
Finally, to obtain H, we only need to reverse the paths of H, i.e. reverse the
arrows in each Hk and reverse the sequence of join-irreducible elements.
The procedure described in Theorem 1 to compute the zeta and Möbius
transforms of a function on P is always less than O(|∨ I(P )|.|P |). Its upper
bound complexity for the distributive lattice L = L supp(f ) is O(|∨ I(L)|.|L|),
which is actually the optimal one for a lattice.
Yet, we can reduce this complexity even further if we have ∧ supp(f ) or

supp(f ). This is the motivation behind the procedure decribed in the follow-
ing Theorem 2. As a matter of fact, [3] proposed a meta-procedure producing
an algorithm that computes the zeta and Möbius transforms in an arbitrary
intersection-closed family F of sets of 2Ω with a circuit of size O(|Ω|.|F |). How-
ever, this meta-procedure is O(|Ω|.2|Ω| ). Here, Theorem 2 provides a procedure
that directly computes the zeta and Möbius transforms with the optimal com-
plexity O(|Ω|.|F |), while being much simpler. Besides, our method is far more
general since it has the potential (depending on data structure) to reach this
complexity in any meet-closed subset of a finite distributive lattice.
Theorem 2 (Efficient Möbius Transformation in a join-closed or
meet-closed subset of P). Let us consider a meet-closed subset M of P (such
as ∧ supp(f )). Also, let the join-irreducible elements ι(M ) be ordered such that
∀ik , il ∈ ι(M ), k < l ⇒ ik ≥ il .
The sequence H M of graphs HkM computes the zeta and Möbius transforma-
tions of GM ≥ = {(x, y) ∈ M /x ≥ y} if:
2


HkM = (x, y) ∈ M 2 /x = y
  
or x = {s ∈ M/s ≥ y ∨ ik } and y ∨ ι(M )k ≥ x ,
400 M. Chaveroche et al.

∅ {a} {b} {a,b} {c} {d} {b,c,d} Ω


• • • • • • • •
y ∪ {a} → H1M
• • • • • • • •
y ∪ {b} → H2M
• • • • • • • •
y ∪ {c} → H3M
• • • • • • • •
y ∪ {d} → H4M
• • • • • • • •

Fig. 3. Illustration representing the arrows contained in the sequence H M when


computing the zeta transformation of GM ⊇ = {(x, y) ∈ M 2 /x ⊇ y}, where
M = {∅, {a}, {b}, {a, b}, {c}, {d}, {b, c, d}, Ω} with Ω = {a, b, c, d} and ι(M ) =
{{a}, {b}, {c}, {d}}. For the sake of clarity, identity arrows are not displayed.

where ι(M )k = {i1 , i2 , . . . , ik }. This sequence is illustrated in Fig. 3. Its execution


is O(|ι(M )|.|M |.),where  represents the number of operations required to obtain
the proxy element {s ∈ M/s ≥ y ∨ ik } of x. It can be as low as 1 operation2 .
Dually, the expression of H M follows the same pattern, simply reversing the
paths of H M by reversing the arrows in each HkM and reversing the sequence of
join-irreducible elements.
Similarly, if P is a Boolean lattice, then the dual H J of this sequence H M
of graphs computes the zeta and Möbius transformations of GJ≤ = {(x, y) ∈
J 2 /x ≤ y}, where J is a join-closed subset of P (such as ∨ supp(f )). Let the
meet-irreducible elements ι(J) of the smallest sublattice of P containing J be
ordered such that ∀ik , il ∈ ι(J), k < l ⇒ ik ≤ il . We have:

J
Hk = (x, y) ∈ J 2 /x = y
  

or x = s ∈ J/s ≤ y ∧ ik and y ∧ ι(J)k ≤ x ,

where ι(J)k = {i1 , i2 , . . . , ik }.


Dually, the expression of H J follows the same pattern, simply reversing the
paths of H J by reversing the arrows in each HkJ and reversing the sequence of
meet-irreducible elements.

Proof. By definition, for all k and ∀(x, y) ∈ HkM , we have x, y ∈ M and x ≥ y,


i.e. (x, y) ∈ GM M
≥ . Reciprocally, ∀(x, y) ∈ G≥ , we have x ≥ y, which can be
decomposed as a unique path (g1 , g2 , . . . , g|ι(M )| ) ∈ H1M × H2M × · · · × H|ι(M
M
)| :

2
This unit cost can be obtained when P = 2Ω using a dynamic binary tree as data
structure for the representation of M . With it, finding the proxy element only takes
the reading of a binary string, considered as one operation. Further details will soon
be available in an extended version of this work [5].
Efficient Möbius Transformations and Their Applications to D-S Theory 401

The idea is that we use the same procedure as in Theorem 1 that builds unique
paths simply by generating all elements of a finite distributive lattice L based on
the join of its join-irreducible elements step by step, as if we had M ⊆ L, except
that we remove all elements that are not in M . Doing so, the only difference
is that the join y ∨ ik of an element y of M with a join-irreducible ik ∈ ι(M )
of this hypothetical lattice L is not necessary in M . However,  thanks to the
meet-closure of M and to the synchronizing condition y ∨ ι(M )k ≥ p, we can
“jump the gap” between two elements y and p of M separated by elements of
L\M and maintain the unicity of the path between any two elements x and y
of M . Indeed, for all join-irreducible element ik ∈ ι(M ), if x ≥ y ∨ ik , then
sinceM is meet-closed, we have an element p of M that we call proxy such that
p = {s ∈ M/s ≥ y ∨ ik }. Yet, we have to make sure that (1) p can only be
obtained from y with exactly one particular ik if p = y, and (2) that the sequence
of these particular join-irreducible elements forming the arrows of the path from
x toy are in the correct order. This is the purpose of the synchronizing condition
y ∨ ι(M )k ≥ p.
For (1), we will  show that for a same proxy p, it holds that ∃!k ∈ [1, |ι(M )|]
such that p = y, y ∨ ι(M )k ≥ p and y ≥ ik . Recall that we ordered the elements
ι(M ) such that ∀ij , il ∈ ι(M ), j < l ⇒ ij ≥ il . Let us note k the greatest
index among [1, |ι(M )|] such that p ≥ ik and y ≥ ik . It is easy to see that the
synchonizing condition is statisfied  for ik . Then, for all j ∈ [1, k −1], Corollary
1 and Lemma 1 give us that y ∨ ι(M )j ≥ ik , meaning that y ∨ ι(M )j ≥ p.
For all j ∈ [k + 1, |ι(M )|], either y ≥ ij (i.e. p = y ∨ ij = y) or p ≥ ij . Either
way, it is impossible to reach p from y ∨ ij . Therefore, there exists a unique path
from y to p that takes the arrow (p, y) from HkM .
Concerning
 (2), for all (x, y) ∈ GM ≥ , x = y, let us note the proxy element
p1 = {s ∈ M/s ≥ y ∨ ik1 } where k1 is the greatest index among [1, |ι(M )|] such
that p1 ≥ ik1 and y ≥ ik1 . We have (p1 , y) ∈ HkM1 . Let us suppose that  there exists
another proxy element p2 such that p2 = p1 , x ≥ p2 and p2 = {s ∈ M/s ≥
p1 ∨ ik2 } where k2 is the greatest index among [1, |ι(M )|] such that p2 ≥ ik2 and
p1 ≥ ik2 . We have (p2 , p1 ) ∈ HkM2 . Since p2 > p1 and p1 ≥ ik1 , we have that
p2 ≥ ik1 , i.e. k2 = k1 . So, two cases are possible: either k1 > k2 or k1 < k2 .
If k1 > k2 , then there is a path ((p2 , p1 ), (p1 , p1 ), . . . , (p1 , p1 ), (p1 , y)) from p2
to y. Moreover, we know that at step k1 , we get p1 from y and that we have
p2 ≥ ik1 and y ≥ ik1 , meaning that there could only exist an arrow (p2 , y) in
HkM3 if k3 > k1 > k2 . Suppose this k3 exists. Then, since k3 > k1 > k2 , we
have that p2 ≥ ik3 and y ≥ ik3 , but also p1 ≥ ik3 since we would have k1 = k3
otherwise. This implies that k2 = k3 , which is impossible. Therefore, there is no
k3 such that (p2 , y) ∈ HkM3 , i.e. there is a unique path from p2 to y. Otherwise, if
k1 < k2 , then the latter path between p2 and y does not exist. But, since p1 ≥ ik2
and p1 ≥ y, we have y ≥ ik2 , meaning that there exists an arrow (p2 , y) ∈ HkM2 ,
which forms a unique path from p2 to y. The recurrence of this reasoning enables
us to conclude that there is a unique path from x to y.
Thus, the condition of Sect. 2.1 is satisfied. H M computes the zeta transfor-
mation of GM ≥ . Also, for the same reasons as with Theorem 1, we have that H
M
M J J
computes the Möbius transformation of G≥ . The proof for H and G≤ is analog
if P is a Boolean lattice.
402 M. Chaveroche et al.

4 Discussions for Dempster-Shafer Theory


In DST, we work with P = 2Ω , in which the singletons are its join-irreducible
elements. If |supp(f )| is of same order of magnitude as n or lower, where n = |Ω|,
then we can compute the focal points ∧ supp(f ) or ∨ supp(f ) and use our Effi-
cient Möbius Transformation of Theorem 2 to compute any DST transforma-
tion (e.g. the commonality/implicability function, the conjunctive/disjunctive
weight function, etc, i.e. wherever the FMT applies) in at most O(n.|supp(f )| +
|ι(supp(f ))|.|R supp(f )|) operations, where R ∈ {∧, ∨}, which is at most O(n.2n ).
Otherwise, we can compute L,↑ supp(f ) or L,↓ supp(f ) of Proposition 2, and
then use the Efficient Möbius Transformation of Theorem 1 to compute the same
DST transformations in O(n.|supp(f )| + |ι(supp(f ))|.|L,A supp(f )|) operations,
where A ∈ {↑, ↓}, which is at most O(n.2n ).
Therefore, we can always compute DST transformations more efficiently than
the FMT with the EMT if supp(f ) is given.
Moreover, L,↓ supp(f ) can be optimized if Ω ∈ supp(f ) (which causes the
equality L,↓ supp(f ) = L supp(f )). Indeed, one can equivalently compute the
lattice L,↓ (supp(f )\{Ω}), execute the EMT of Theorem 1, and then add the
value on Ω to the value on all sets of L,↓ (supp(f )\{Ω}). Dually, the same can
be done with L,↑ (supp(f )\{∅}). This trick can be particularly useful in the case
of the conjunctive or disjunctive weight function, which requires that supp(f )
contains respectively Ω or ∅.
Also, optimizations built for the FMT, such as the reduction of Ω to the
core C or its optimal coarsened version Ω  , are already encoded in the use of
the function ι (see Example 1), but optimizations built for the evidence-based
approach, such as approximations by reduction of the number of focal sets, i.e.
reducing the size of supp(f ), can still greatly enhance the EMT.
Finally, while it was proposed in [9] to fuse two mass functions m1 and
m2 using Dempster’s rule by computing the corresponding commonality func-
tions q1 and q2 in O(n.2n ), then q12 = q1 .q2 in O(2n ) and finally comput-
ing back the fused mass function m12 from q12 in O(n.2n ), here we propose
an even greater detour that has a lower complexity. Indeed, by computing q1
and q2 on ∧ supp(m1 ) and ∧ supp(m2 ), then the conjunctive weight functions
w1 and w2 on these same elements, we get w12 = w1 .w2 in O(|∧ supp(m1 ) ∪

supp(m2 )|) (all other set has a weight equal to 1). Consequently, we obtain
the set supp(1 − w12 ) ⊆ ∧ supp(m1 ) ∪ ∧ supp(m2 ) which can be used to com-
pute ∧ supp(1 − w12 ) or L,↓ supp(1 − w12 ). From this, we simply compute q12
and then m12 in O(n.|supp(1 − w12 )| + |ι(supp(1 − w12 ))|.|∧ supp(1 − w12 )|) or
O(n.|supp(1 − w12 )| + |ι(supp(1 − w12 ))|.|L,↓ supp(1 − w12 )|).

Example 1 (Consonant case). If supp(f ) = {F1 , F2 , . . . , FK } such that F1 ⊂


F2 ⊂ · · · ⊂ FK , then the coarsening Ω  of Ω will have an element for each
element of supp(f ), while ι(supp(f )) will have a set of elements for each element
of supp(f ). So, we get |Ω  | = |ι(supp(f ))| = K. But, Ω  is then used to generate

the Boolean lattice 2Ω , of size 2K , where ι(supp(f )) is used to generate an
arbitrary lattice L supp(f ), of size K in this particular case (K +1 if ∅ ∈ supp(f )).
Efficient Möbius Transformations and Their Applications to D-S Theory 403

5 Conclusion
In this paper, we proposed the Efficient Möbius Transformations (EMT), which
are general procedures to compute the zeta and Möbius transforms of any func-
tion defined on any finite distributive lattice with optimal complexity. They are
based on our reformulation of the Möbius inversion theorem with focal points
only, featured in an upcoming detailed article [5] currently in preparation. The
EMT optimally exploit the information contained in both the support of this
function and the structure of distributive lattices. Doing so, the EMT always
perform better than the optimal complexity for an algorithm considering the
whole lattice, such as the FMT for all DST transformations, given the support
of this function. In [5], we will see that our approach is still more efficient when
this support is not given. This forthcoming article will also feature examples of
application in DST, algorithms and implementation details.

References
1. Barnett, J.A.: Computational methods for a mathematical theory of evidence. In:
Proceedings of IJCAI, vol. 81, pp. 868–875 (1981)
2. Björklund, A., Husfeldt, T., Kaski, P., Koivisto, M.: Trimmed moebius inversion
and graphs of bounded degree. Theory Comput. Syst. 47(3), 637–654 (2010)
3. Björklund, A., Husfeldt, T., Kaski, P., Koivisto, M., Nederlof, J., Parviainen, P.:
Fast zeta transforms for lattices with few irreducibles. ACM TALG 12(1), 4 (2016)
4. Chaveroche, M., Davoine, F., Cherfaoui, V.: Calcul exact de faible complexité
des décompositions conjonctive et disjonctive pour la fusion d’information. In:
Proceedings of GRETSI (2019)
5. Chaveroche, M., Davoine, F., Cherfaoui, V.: Efficient algorithms for Möbius trans-
formations and their applications to Dempster-Shafer Theory. Manuscript available
on request (2019)
6. Dempster, A.: A generalization of Bayesian inference. J. Roy. Stat. Soc. Ser. B
(Methodol.) 30, 205–232 (1968)
7. Gordon, J., Shortliffe, E.H.: A method for managing evidential reasoning in a
hierarchical hypothesis space. Artif. Intell. 26(3), 323–357 (1985)
8. Kaski, P., Kohonen, J., Westerbäck, T.: Fast Möbius inversion in semimodular
lattices and U-labelable posets. arXiv preprint arXiv:1603.03889 (2016)
9. Kennes, R.: Computational aspects of the Mobius transformation of graphs. IEEE
Trans. Syst. Man Cybern. 22(2), 201–223 (1992)
10. Sarabi-Jamab, A., Araabi, B.N.: Information-based evaluation of approximation
methods in Dempster-Shafer Theory. IJUFKS 24(04), 503–535 (2016)
11. Shafer, G.: A Mathematical Theory of Evidence. Princeton University Press,
Princeton (1976)
12. Shafer, G., Logan, R.: Implementing Dempster’s rule for hierarchical evidence.
Artif. Intell. 33(3), 271–298 (1987)
13. Shenoy, P.P., Shafer, G.: Propagating belief functions with local computations.
IEEE Expert 1(3), 43–52 (1986)
14. Wilson, N.: Algorithms for Dempster-Shafer Theory. In: Kohlas, J., Moral, S. (eds.)
Handbook of Defeasible Reasoning and Uncertainty Management Systems: Algo-
rithms for Uncertainty and Defeasible Reasoning, vol. 5, pp. 421–475. Springer,
Netherlands (2000). https://doi.org/10.1007/978-94-017-1737-3 10
Dealing with Continuous Variables
in Graphical Models

Christophe Gonzales(B)

Aix-Marseille Université, CNRS, LIS, Marseille, France


[email protected]

Abstract. Uncertain reasoning over both continuous and discrete ran-


dom variables is important for many applications in artificial intelligence.
Unfortunately, dealing with continuous variables is not an easy task. In
this tutorial, we will study some of the methods and models developed
in the literature for this purpose. We will start with the discretization of
continuous random variables. A special focus will be made on the numer-
ous issues they raise, ranging from which discretization criterion to use,
to the appropriate way of using them during structure learning. These
issues will justify the exploitation of hybrid models designed to encode
mixed probability distributions. Several such models have been proposed
in the literature. Among them, Conditional Linear Gaussian models are
very popular. They can be used very efficiently for inference but they
lack flexibility in the sense that they impose that the continuous ran-
dom variables follow conditional Normal distributions and are related to
other variables through linear relations. Other popular models are mix-
tures of truncated exponentials, mixtures of polynomials and mixtures of
truncated basis functions. Through a clever use of mixtures of distribu-
tions, these models can approximate very well arbitrary mixed probabil-
ity distributions. However, exact inference can be very time consuming
in these models. Therefore, when choosing which model to exploit, one
has to trade-off between the flexibility of the uncertainty model and the
computational complexity of its learning and inference mechanisms.

Keywords: Continuous variable · Hybrid graphical model ·


Discretization

Since their introduction in the 80’s, Bayesian networks (BN) have become one
of the most popular model for handling “precise” uncertainties [24]. However,
by their very definition, BNs are limited to cope only with discrete random
variables. Unfortunately, in real-world applications, it is often the case that some
variables are of a continuous nature. Dealing with such variables is challenging
both for learning and inference tasks [9]. The goal of this tutorial is to investigate
techniques used to cope with such variables and, more importantly, to highlight
their pros and cons.

c Springer Nature Switzerland AG 2019


N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 404–408, 2019.
https://doi.org/10.1007/978-3-030-35514-2_30
Dealing with Continuous Variables in Graphical Models 405

1 Mapping Continuous Variables into Discrete Ones


Probably, the simplest way to cope with continuous random variables in graphi-
cal models is to discretize them. Once the variables are discretized, these models
can be learnt and exploited as usual [4,19,29]. However, appropriately discretiz-
ing variables raises many issues. First, when learning the graphical model, should
all the variables be discretized independently or should the dependencies among
variables learnt so far be taken into account to jointly discretize some sets of vari-
ables? The second alternative provides better results and is therefore advocated
in the literature [7,18,21]. However, determining the best joint discretization is
a complex task and only approximations are provided. In addition, when dis-
cretizing while learning the graphical model structure, it is tempting to define
an overall function scoring both discretization and structure. Optimizing such a
function therefore provides both an optimal structure and a discretization most
suited for this structure. This is the approach followed in [7,21]. However, we
shall see that this may prove to be a bad idea because many of the structure
scoring functions represent posterior likelihoods and it is easy to construct dis-
cretizations resulting in infinite likelihoods whatever the structure. Choosing
the criterion to optimize to determine the best discretization is also a chal-
lenge. Depending on the kind of observations that will be used subsequently in
inferences, it may or may not be useful to consider uniform or non-uniform den-
sity functions within discretization intervals. As discretizing variables result in a
loss of information, people often try to minimize this loss and therefore exploit
entropy-based criteria to drive their search for the optimal discretization. While
at first sight this seems a good idea, we will see that this may not be appropri-
ate for structure learning and other criteria such as cluster-based optimization
[18] or Kullback-Leibler divergence minimization [10] are probably much more
appropriate. It should also be emphasized that inappropriate discretizations may
have a significant impact on the learnt structure because, e.g., dependent con-
tinuous random variables may become independent when discretized. This is the
very reason why it is proposed in [20] to compute independence tests at several
different discretization resolutions.

2 Hybrid Graphical Models


As shown above, discretizations raise many issues. To avoid them, several models
have been introduced to directly cope with continuous variables. Unfortunately,
unlike in the discrete case, in the continuous case, there does not exist a universal
representation for conditional probabilities [9, chap. 14]. In addition, determining
conditional independencies among random variables is much more complicated
in general in the continuous case than in the discrete one [1]. Therefore, one
has to choose one such representation and one actually has to trade-off between
the flexibility of the uncertainty model and the computational complexity of its
learning and inference mechanisms.
406 C. Gonzales

Conditional Gaussian models and their mixing with discrete variables


[14,16,17] lie on one side of the spectrum. They compactly represent multivari-
ate Gaussian distributions (and their mixtures). In pure linear Gaussian models
(i.e., when there are no discrete variables), junction-tree based exact inference
mechanisms prove to be computationally very efficient (even more than in dis-
crete Bayesian networks) [15,28]. However, their main drawback is their lack
of flexibility: they can only model large multivariate Gaussian distributions. In
addition, the relationships between variables can only be linear. To deal with
more expressive mixed probability distributions, Conditional Linear Gaussian
models (CLG) allow discrete variables to be part of the graphical model, with
the constraint that the relations between the continuous random variables are
still limited to linear ones. By introducing latent discrete variables, this limi-
tation can be mitigated. This is a significant improvement, although CLGs are
still not very well suited to represent models in which random variables are not
distributed w.r.t. Normal distributions. Note that unlike inference in LG mod-
els, which contain no discrete variable, and can be performed exactly efficiently,
in CLGs, for some part of the inference, one may have to resort to approxi-
mations (the so-called weak marginalization) [15]. Structure learning can also
be performed efficiently in CLGs (at least when there are no latent variables)
[6,8,11,16].
To overcome the lack of flexibility of CLGs, other models have been intro-
duced that rely neither on Normal distributions nor on linear relations between
the variables. Among the most popular, let us cite mixtures of exponentials
(MTE) [3,22,27], mixtures of truncated basis functions (MoTBF) [12,13] and
mixtures of polynomials (MoP) [30,31,33]. As their names suggest, these three
models approximate mixed probability distributions by way of mixtures of spe-
cific types of probability density functions: in the case of MTEs and MoPs, those
are exponentials and polynomials respectively. MoTBFs are more general and
only require that the basis functions are closed under product and marginal-
ization. Mixture distributions have been well studied in the literature, notably
from the learning perspective [25]. However, unlike [25] in which the number
of components of the mixture is implicitly assumed to be small, the design of
MTEs, MoPs and MoTBFs allows them to compactly encode mixtures with
exponential numbers of components. The rationale behind all these models is
that, by cleverly exploiting mixtures, they can approximate very well (w.r.t. the
Kullback-Leibler distance) arbitrary mixed probability distributions [2,3]. MoPs
have several advantages over MTEs: their parameters for approximating density
functions are easier to determine than those of MTEs. They are also applicable
to a larger class of deterministic functions in hybrid Bayesian networks. These
models are generally easy to learn from datasets [23,26]. In addition, they satisfy
Shafer-Shenoy’s propagation axioms [32] and inference can thus be performed
using a junction tree-based algorithm [2,12,22]. However, in MTEs, MoPs and
MoTBFs, combinations and projections are Algebraic operations over sums of
functions. As such, as the inference progresses, the number of terms involved in
these sums tends to grow exponentially, thereby limiting the use of this exact
Dealing with Continuous Variables in Graphical Models 407

inference mechanism to problems with only a small number of cliques. To over-


come this issue, approximate algorithms based on MCMC [22] or on the Penniless
algorithm [27] have been provided in the literature.

3 Conclusion
Dealing with continuous random variables in probabilistic graphical models is
challenging. Either one has to resort to discretization, but this raises many issues
and the result may be far from the expected one, or to exploiting models specif-
ically designed to cope with continuous variables. But choosing the best model
is not easy in the sense that one has to trade-off between the flexibility of the
model and the complexity of its learning and inference. Clearly, there is still
room for improvements in such models, maybe by exploiting other features of
probabilities, like, e.g., copula [5].

References
1. Bergsma, W.: Testing conditional independence for continuous random variables.
Technical report, 2004–049, EURANDOM (2004)
2. Cobb, B., Shenoy, P.: Inference in hybrid Bayesian networks with mixtures of
truncated exponentials. Int. J. Approximate Reasoning 41(3), 257–286 (2006)
3. Cobb, B., Shenoy, P., Rumı́, R.: Approximating probability density functions in
hybrid Bayesian networks with mixtures of truncated exponentials. Stat. Comput.
16(3), 293–308 (2006)
4. Dechter, R.: Bucket elimination: a unifying framework for reasoning. Artif. Intell.
113, 41–85 (1999)
5. Elidan, G.: Copula Bayesian networks. In: Proceedings of NIPS 2010, pp. 559–567
(2010)
6. Elidan, G., Nachman, I., Friedman, N.: “ideal parent” structure learning for con-
tinuous variable Bayesian networks. J. Mach. Learn. Res. 8, 1799–1833 (2007)
7. Friedman, N., Goldszmidt, M.: Discretizing continuous attributes while learning
Bayesian networks. In: Proceedings of ICML 1996, pp. 157–165 (1996)
8. Geiger, D., Heckerman, D.: Learning Gaussian networks. In: Proceedings of UAI
1994, pp. 235–243 (1994)
9. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Tech-
niques. MIT Press, Cambridge (2009)
10. Kozlov, A., Koller, D.: Nonuniform dynamic discretization in hybrid networks. In:
Proceedings of UAI 1997, pp. 314–325 (1997)
11. Kuipers, J., Moffa, G., Heckerman, D.: Addendum on the scoring of Gaussian
directed acyclic graphical models. Ann. Stat. 42(4), 1689–1691 (2014)
12. Langseth, H., Nielsen, T., Rumı́, R., Salmerón, A.: Inference in hybrid Bayesian
networks with mixtures of truncated basis functions. In: Proceedings of PGM 2012,
pp. 171–178 (2012)
13. Langseth, H., Nielsen, T., Rumı́, R., Salmerón, A.: Mixtures of truncated basis
functions. Int. J. Approximate Reasoning 53(2), 212–227 (2012)
14. Lauritzen, S.: Propagation of probabilities, means and variances in mixed graphical
association models. J. Am. Stat. Assoc. 87, 1098–1108 (1992)
408 C. Gonzales

15. Lauritzen, S., Jensen, F.: Stable local computation with mixed Gaussian distribu-
tions. Stat. Comput. 11(2), 191–203 (2001)
16. Lauritzen, S., Wermuth, N.: Graphical models for associations between variables,
some of which are qualitative and some quantitative. Ann. Stat. 17(1), 31–57
(1989)
17. Lerner, U., Segal, E., Koller, D.: Exact inference in networks with discrete children
of continuous parents. In: Proceedings of UAI 2001, pp. 319–328 (2001)
18. Mabrouk, A., Gonzales, C., Jabet-Chevalier, K., Chojnaki, E.: Multivariate cluster-
based discretization for Bayesian network structure learning. In: Beierle, C., Dekht-
yar, A. (eds.) SUM 2015. LNCS (LNAI), vol. 9310, pp. 155–169. Springer, Cham
(2015). https://doi.org/10.1007/978-3-319-23540-0 11
19. Madsen, A., Jensen, F.: LAZY propagation: a junction tree inference algorithm
based on lazy inference. Artif. Intell. 113(1–2), 203–245 (1999)
20. Margaritis, D., Thrun, S.: A Bayesian multiresolution independence test for con-
tinuous variables. In: Proceedings of UAI 2001, pp. 346–353 (2001)
21. Monti, S., Cooper, G.: A multivariate discretization method for learning Bayesian
networks from mixed data. In: Proceedings of UAI 1998, pp. 404–413 (1998)
22. Moral, S., Rumi, R., Salmerón, A.: Mixtures of truncated exponentials in hybrid
Bayesian networks. In: Benferhat, S., Besnard, P. (eds.) ECSQARU 2001. LNCS
(LNAI), vol. 2143, pp. 156–167. Springer, Heidelberg (2001). https://doi.org/10.
1007/3-540-44652-4 15
23. Moral, S., Rumı́, R., Salmerón, A.: Estimating mixtures of truncated exponentials
from data. In: Proceedings of PGM 2002, pp. 135–143 (2002)
24. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kauffmann, Burlington (1988)
25. Poland, W., Shachter, R.: Three approaches to probability model selection. In: de
Mantaras, R.L., Poole, D. (eds.) Proceedings of UAI 1994, pp. 478–483 (1994)
26. Romero, R., Rumı́, R., Salmerón, A.: Structural learning of Bayesian networks
with mixtures of truncated exponentials. In: Proceedings of PGM 2004, pp. 177–
184 (2004)
27. Rumı́, R., Salmerón, A.: Approximate probability propagation with mixtures of
truncated exponentials. Int. J. Approximate Reasoning 45(2), 191–210 (2007)
28. Salmerón, A., Rumı́, R., Langseth, H., Madsen, A.L., Nielsen, T.D.: MPE infer-
ence in conditional linear Gaussian networks. In: Destercke, S., Denoeux, T. (eds.)
ECSQARU 2015. LNCS (LNAI), vol. 9161, pp. 407–416. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-20807-7 37
29. Shafer, G.: Probabilistic Expert Systems. Society for Industrial and Applied Math-
ematics (1996)
30. Shenoy, P.: A re-definition of mixtures of polynomials for inference in hybrid
Bayesian networks. In: Liu, W. (ed.) ECSQARU 2011. LNCS (LNAI), vol. 6717, pp.
98–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22152-1 9
31. Shenoy, P.: Two issues in using mixtures of polynomials for inference in hybrid
Bayesian networks. Int. J. Approximate Reasoning 53(5), 847–866 (2012)
32. Shenoy, P., Shafer, G.: Axioms for probability and belief function propagation. In:
Proceedings of UAI 1990, pp. 169–198 (1990)
33. Shenoy, P., West, J.: Inference in hybrid Bayesian networks using mixtures of
polynomials. Int. J. Approximate Reasoning 52(5), 641–657 (2011)
Towards Scalable and Robust
Sum-Product Networks

Alvaro H. C. Correia and Cassio P. de Campos(B)

Eindhoven University of Technology, Eindhoven, The Netherlands


[email protected]

Abstract. Sum-Product Networks (SPNs) and their credal


counterparts are machine learning models that combine good represen-
tational power with tractable inference. Yet they often have thousands
of nodes which result in high processing times. We propose the addition
of caches to the SPN nodes and show how this memoisation technique
reduces inference times in a range of experiments. Moreover, we intro-
duce class-selective SPNs, an architecture that is suited for classification
tasks and enables efficient robustness computation in Credal SPNs. We
also illustrate how robustness estimates relate to reliability through the
accuracy of the model, and how one can explore robustness in ensemble
modelling.

Keywords: Sum-Product Networks · Robustness

1 Introduction
Sum-Product Networks (SPNs) [15] (conceptually similar to Arithmetic
Circuits [4]) are a class of deep probabilistic graphical models where exact
marginal inference is always tractable. More precisely, any marginal query can
be computed in time polynomial in the network size. Still, SPNs can capture
high tree-width models [15] and are capable of representing complex and highly
multidimensional distributions [5]. This promising combination of efficiency and
representational power has motivated several applications of SPNs to a variety
of machine learning tasks [1,3,11,16–18].
As any other standard probabilistic graphical model, SPNs learned from
data are prone to overfitting when evaluated at poorly represented regions of the
feature space, leading to overconfident and often unreliable conclusions. However,
due to the probabilistic semantics of SPNs, we can mitigate that issue through a
principled analyses of the reliability of each output. A notable example is Credal
SPNs (CSPNs) [9], a extension of SPNs to imprecise probabilities where we can
compute a measure of the robustness of each prediction. Such robustness values
are useful tools for decision-making, as they are highly correlated with accuracy,
and thus tell us when to trust the CSPN’s prediction: if the robustness of a
prediction is low, we can suspend judgement or even resort to another machine

c Springer Nature Switzerland AG 2019


N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 409–422, 2019.
https://doi.org/10.1007/978-3-030-35514-2_31
410 A. H. C. Correia and C. P. de Campos

learning model. Indeed, we show that robustness is also effective in ensemble


modelling, opening up new avenues for reliable machine learning.
Unfortunately, computing robustness requires many passes through the net-
work, which limits the scalability of CSPNs. We address that by introducing class-
selective (C)SPNs, a type of architecture that enables efficient robustness com-
putations due to their independent sub-networks: one for each label in the data.
Class-selective (C)SPNs not only enable fast robustness estimation but also out-
perform general (C)SPNs in classification tasks. In our experiments, their accuracy
was comparable to that of state-of-the-art methods, such as XGBoost [2].
We also study how to improve the scalability of (C)SPNs by reducing their
inference time. Although (C)SPNs ensure tractable inference, they are often
very large networks spanning thousands of nodes. In practice, that translates to
computational costs that might be too high for some large-scale applications. A
solution is to limit the network size, but that comes at the cost of the model’s
representational power. One way out of this trade-off is to notice that many
operations reoccur often in SPNs. As we descend from the root, the number of
variables in the scope of each node decreases. With a smaller feature space, these
nodes are likely to be evaluated at identical instantiations of their respective
variables, and we can avoid recomputing them by having a cache for previously
seen instances. We investigated the benefit of such memoisation procedure across
25 UCI datasets [8] and observed that it reduces inference times considerably.
This paper is organised as follows. In Sect. 2, we give the necessary notation
and definitions of (credal) sum-product networks. In Sect. 3, we introduce class-
selective (C)SPNs and use them to derive a new algorithm for efficient robustness
computation. We detail memoisation techniques for (C)SPNs in Sect. 4 and show
their practical benefit in Sect. 5, where we report experiments on the effects of
memoisation and the performance of class-selective (C)SPNs. We also discuss
how robustness estimates translate into accuracy and how they can be exploited
in ensemble models. Finally, we conclude and point promising directions for
future work in Sect. 6.

2 (Credal) Sum-Product Networks

Before giving a formal definition of (C)SPNs, we introduce the necessary notation


and background. We write integers in lowercase letters (e.g. i, j, k) and sets
of integers in uppercase calligraphic letters (e.g. E, V). We denote by XV the
collection of random variables indexed by set V, that is, XV = {Xi : i ∈ V}.
We reserve V to represent the indices of all the variables over which a model
is defined, but when clear from the context, we omit the indexing set and use
simply X and x. Note that there is no ambiguity, since individual variables and
instantiations are always denoted with a subscript (e.g. Xi , xi ). The realisation
of a set of random variables is denoted in lowercase letters (e.g. XV = xV ). When
only a subset of the variables is concerned, we use a different indexing set E ⊆ V
to identify the corresponding variables XE and their realisations xE . Here xE is
what we call partial evidence, as not every variable is observed.
Towards Scalable and Robust Sum-Product Networks 411

An SPN is a probabilistic graphical model defined over a set of random


variables XV by a weighted, rooted and acyclic directed graph where internal
nodes perform either sum or product operations, and leaves are associated with
indicator variables. Typically, an indicator variable is defined as the application
of a function λi,j such that

0 if i ∈ E and xi = j
λi,j (xE ) =
1 otherwise,

where xE is any partial or complete evidence. The SPN and its root node are
used interchangeably to mean the same object. We assume that every indicator
variable appears in at most one leaf node. Every arc from a  sum node i to a
child j is associated with a non-negative weight wi,j such that j wi,j = 1 (this
constraint does not affect the generality of the model [14]).
Given an SPN S and a node i, we denote S i the SPN obtained by rooting
the network at i, that is, by discarding any non-descendant of i (other than i
itself). We call S i the sub-network rooted at i, which is an SPN by itself (albeit
over a possibly different set of variables). If ω are the weights of an SPN S and
i is a node, we denote by ωi the weights in the sub-network S i rooted at i, and
by wi the vector of weights wi,j associated with arcs from i to children j.
The scope of an SPN is the set of variables that appear in it. For an SPN
which is a leaf associated with an indicator variable, the scope is the respective
random variable. For an SPN which is not a leaf, the scope is the union of the
scopes of its children. We assume that scopes of children of a sum node are
identical (completeness) and scopes of children of a product node are disjoint
(decomposability) [12].
The value of an SPN S at a given instantiation xE , written S(xE ), is
defined recursively in terms of its root node r. If r is a leaf node associated
with indicator variable  λr,xr then S(xE ) = λr,xr (xE ). Else, if r is a product
node, then S(xE ) = j S j (xE ), where j ranges over the children of r. Finally,
 j
if r is a sum node then S(xE ) = j wr,j S (xE ), where again j ranges over
the children of r. Given these properties, it is easy to check that S induces
a probability distribution for XV such that S(xE ) = P(xE ) for any xE and
E ⊆ V. One can  also compute expectations of functions over a variable Xi as
E(f |XE = xE ) = xi f (xi )P(xi |XE = xE ).
A Credal SPN (CSPN) is defined similarly, except for containing sets of
weight vectors in each sum node instead of a single weight vector. More precisely,
a CSPN C is defined by a set of SPNs C = {Sω : ω ∈ C} over the same graph
structure of S, where C is the Cartesian product of finitely-generated simplexes
Ci , one for each sum node i, such that the weights wi of a sum node i are con-
strained by Ci . While an SPN represents one joint distribution over its variables,
a CSPN represents a set of joint distributions. Therefore, one can use CSPNs to
obtain lower and upper bounds minω Eω (f |XE = xE ) and maxω Eω (f |XE = xE )
on the expected value of some function f of a variable, conditional on evidence
XE = xE . Recall that each choice of the weights ω of a CSPN {Sω : ω ∈ C}
412 A. H. C. Correia and C. P. de Campos

defines an SPN and hence induces a probability measure Pω . We can therefore


compute bounds on the conditional expectations of a function over Xi :

min Eω (f |XE = xE ) = min f (xi )Pω (Xi = xi |XE = xE ) . (1)
ω ω
xi

The equations above are well-defined if minω P(XE = xE ) > 0, which we


will assume to be true (statistical models often have some smoothing so that
zero probability is not attributed to any assignment of variables – this is our
assumption here, for simplicity, though this could be addressed in more sophis-
ticated terms). Note also that we can focus on the computation of the lower
expectation, as the upper expectation can be obtained from maxω Eω (f |xE ) =
− minω Eω (−f |xE ).
Computing the lower conditional expectation in Eq. (1) is equivalent to
finding whether:

min Eω (f |XE = xE ) > 0 ⇐⇒ min f (xi )Pω (Xi = xi , XE = xE ) > 0 , (2)
ω ω
xi

as to obtain the exact value of the minimisation one can run a binary search for
μ until minω Eω ((f − μ)|xE ) = 0 (to the desired precision). The following result
will be used here. Corollary 1 is a small variation of the result in [9].
Theorem 1 (Theorem 1 in [9]). Consider a CSPN C = {Sω : ω ∈ C}.
Computing minω Sω (xE ) and maxω Sω (xE ) takes O(|C| · K) time, where |C| is
the number of nodes and arcs in C, and K  is an upper bound on the cost of
solving a linear program of the form minwi j ci,j wi,j subject to wi ∈ Ci .

Corollary 1. Consider a CSPN C = {Sω : ω ∈ C} with a bounded number of


children per sum node specified by simplexes Ci of (finitely many) constraints of
the form li,j ≤ wi,j ≤ ui,j for given rationals li,j ≤ ui,j . Computing minω Sω (xE )
and maxω Sω (xE ) can be solved in time O(|S|).

Proof. When local simplexes Ci have constraints li,j ≤ wi,j ≤ ui,j , then the
local optimisations S i (xE ) = minwi j wi,j S j (xE ) are equivalent to fractional
knapsack problems [7], which can be solved in constant time for nodes with
bounded number of children. Thus, the overall running time is O(|S|).

3 Efficient Robustness Measure Computation


We now define an architecture called class-selective (C)SPN that is provenly
more efficient in computing robustness values. Figure 1 illustrates its structure.

Definition 1. Consider a domain where variable Xc is called the class variable.


A class-selective (C)SPN has a sum node as root node S with |Xc | product nodes
as its children, each of which has an indicator leaf node for a different value xc
of Xc (besides potentially other sibling (C)SPNs). These product nodes that are
children of S have disjoint sets of internal descendant nodes.
Towards Scalable and Robust Sum-Product Networks 413

The name class-selective was inspired by selective SPNs [13], where only one
child of each sum node is active at a time. In a class-selective SPN such property
is restricted to the root node: for a given class value, only one of the sub-networks
remains active. That is made clear in Fig. 1 as only one of the indicator nodes
Cn is non-zero and all but one of children of the root node evaluate to zero.

...

... C1 ... C2 ... ... Cn


.. .. .. .. .. ..
. . . . . .

Fig. 1. Illustration of a class-selective SPN. In the graph, Cn is a leaf node applying


the indicator function λc,n (xE ).

In a class-selective CSPN, the computation of expectation of a function of


the class variable can be achieved as efficiently as in standard SPNs:
 
min f (xc )Pω (Xc = xc , XE = xE ) = min f (xc )wr,xc min Sωxxc c (xc , xE )
ω wr ωxc
xc xc :f (xc )≥0

+ f (xc )wr,xc max Sωxxc c (xc , xE ) ,
ωxc
xc :f (xc )<0

where r is the root node index with children S xc for each value xc . Note that each
of these internal optimisations can be obtained by independent executions which
take altogether time O(|S|) by Corollary 1 (as each execution runs over non-
overlapping sub-CSPNs corresponding to different class labels xc ). Moreover,
note that in a non-credal class-selective SPN, finding the class label of maximum
probability (and its probability) takes time O(|S|) in the worst case. That is more
efficient than general SPNs, where |S| · |Xc | nodes may need to be visited.
Let us turn our attention to the CSPN robustness estimation in a classifica-
tion problem. Given input instance XE = xE for which we want to predict the
class variable value, we say that the classification using a CSPN C is robust if
the class value xc = arg maxxc P(xc |xE ) predicted by an SPN Sω that belongs to
C = {Sω : ω ∈ C} is also the prediction of any other Sω ∈ C (hence it is unique
for C), which happens if and only if
min Eω (Ixc − Ixc |XE = e) > 0 for every xc = xc . (3)
ω

In the case of class-selective CSPNs, this task equates to checking whether


min Pω (Xc = xc , XE = xE ) > max max Pω (Xc = xc , XE = xE ) ,
ω  xc =xc ω
414 A. H. C. Correia and C. P. de Campos

Algorithm 1. Efficient -robustness computation.


1 Function Robustness(S, xE , xc , er):
Data : Class-selective SPN S, Input xE , Prediction xc | XE = xE ,
Precision er < 1
Result: Robustness 
2 max ← 1;
3 min ← 0;
4 while min < max − er do
5  ← (min + max )/2;
6 v ← minω Sω (xc , xE );
7 for xc = xc do
8 v  ← maxω Sω (xc , xE );
9 if v  ≥ v then
10 max ← ;
11 break
12 end
13 end
14 if max >  then
15 min ← 
16 end
17 end
18 return ;

that is, regardless of the choice of weights w ∈ C, we would have Pω (xc |xE ) >
Pω (xc |xE ) for all other labels xc .
General CSPNs may require 2 · |S| · (|Xc | − 1) node evaluations in the worst
case to identify whether a predicted class label xc is robust for instance XE = xE ,
while a class-selective CSPN obtains such result in |S| node evaluations. This
is because CSPNs will run over its nodes (|Xc | − 1) times in order to reach a
conclusion about Expression (3), while the class-selective CSPN can compute
minω Sω (xc , xE ) and maxω Sω (xc , xE ) (the max is done for each xc ) only once
(taking overall |S| node evaluations, since they run over non-overlapping sub-
networks for different class values).
Finally, given an input instance XE = xE and an SPN S learned from data,
we can compute a robustness measure as follows. We define a collection of CSPNs
CS, parametrised by 0 ≤  < 1 such that each wi of a node i in CS, is allowed
to vary within an -contaminated credal set of the original weight vector wi of
the same node i in S. A robustness measure for the issued prediction CS, (xE )
is then defined by the largest  such that CS, is robust for xE . Finding such 
can be done using a simple binary search, as shown in Algorithm 1.

4 Memoisation
When evaluating an SPN on a number of different instances, some of the com-
putations are likely to reoccur, especially in nodes with scopes of only a few
Towards Scalable and Robust Sum-Product Networks 415

variables. One simple and yet effective solution to reduce computing time is to
cache the results at each node. Thus, when evaluating the network recursively,
we do not have to visit any of the children of a node if a previously evaluated data
point had the same instantiation over the scope of that node. To be more precise,
consider a node S with scope S, and two instances x, x such that xS = xS . It is
clear that S(x) = S(x ), so once we have cached the value of one, there is no need
to reevaluate node S, or any of its children, on the other. For a CSPN C, the
same approach holds, but we need different caches for maximisation and min-
imisation, as well as for different SPNs Sω that belong to C (after all, a change
of ω may imply a change of the result). Notice that the computational overhead
of memoisation is amortised constant, as it amounts to accessing a hash table.
Mei et al. proposed computing maximum a posteriori (MAP) by storing the
values of nodes at a given evidence and searching for the most likely query in a
reduced SPN, where nodes associated with the evidence are pruned [10]. Memoi-
sation can be seen as eliminating such nodes implicitly (as their values are stored
and they are not revisited) but goes further by using values calculated at other
instances to save computational power. In fact, the application of memoisation
to MAP inference is a promising avenue for future research, since many methods
(e.g. hill climbing) evaluate the SPN at small variations of the same input that
are likely to share partial instantiations in many of the nodes in the network.

5 Experiments
We investigated the performance of memoisation and class-selective (C)SPNs
through a series of experiments over a range of 25 UCI datasets [8]. All experi-
ments were run in a single modern core with our implementation of Credal SPNs,
which runs LearnSPN [6] for structure learning. Source code is available on our
pages and/or upon request.
Table 1 presents the UCI data sets on which we ran experiments described
by their number of independent instances N , number of variables |X| (including
a class variable Xc ) and number of class labels |Xc |. All data sets are categor-
ical (or have been made categorical using discretisation by median value). We
also show the 0–1 classification accuracy obtained by both General and Class-
selective SPNs as well as the XGBoost library that provides a parallel tree gra-
dient boosting method [2], considered a state-of-the-art technique for supervised
classification tasks. Results are obtained by stratified 5-fold cross-validation, and
as one can inspect, class-selective SPNs largely outperformed general SPNs while
being comparable to XGBoost in terms of classification accuracy.
We leave further comparisons between general and class-selective networks
to the appendix, where Table 3 depicts the two types of network in terms of their
architecture and processing times on classification tasks. There one can see that
class-selective SPNs have a higher number of parameters due to a larger number
of sum nodes. However, in some cases general SPNs are deeper, which means
class-selective SPNs tend to grow sideways, especially when the number of classes
is high. Nonetheless, the larger number of parameters in class-selective networks
416 A. H. C. Correia and C. P. de Campos

Table 1. Percent accuracy of XGBoost, General SPNs and Class-selective SPNs across
several UCI datasets. All experiments consisted in stratified 5-fold cross validation.

Dataset N |X| |Xc | XGBoost General SPN Class-selective SPN


zoo 101 17 7 93.069 76.238 96.04
bridges 107 11 6 66.355 57.009 63.551
lymph 148 18 4 81.757 72.973 81.081
flags 194 29 8 66.495 43.299 57.732
autos 205 26 2 91.22 88.293 89.268
breast cancer 286 10 2 72.378 71.678 67.832
heart h 294 12 2 79.592 79.932 81.633
ecoli 336 6 8 72.321 63.393 74.405
liver disorders 345 7 2 65.797 57.391 64.638
dermatology 366 35 6 97.268 81.694 98.907
colic 368 23 2 83.696 77.717 78.533
balance scale 625 5 3 72 72.48 72.16
soybean 683 36 19 93.704 62.518 94.583
diabetes 768 9 2 70.313 70.703 69.922
vehicle 846 19 4 65.485 46.454 60.402
tic tac toe 958 10 2 84.969 69.937 73.382
vowel 990 14 11 64.747 33.737 59.394
solar flare 2 1,066 12 6 73.64 59.475 73.077
cmc 1,473 10 3 51.799 48.133 48.065
car 1,728 7 4 87.905 70.023 93.287
segment 2,310 17 7 82.771 67.662 80.823
sick 3,772 28 2 93.902 93.876 91.463
hypothyroid 3,772 28 4 92.285 92.285 91.569
spambase 4,601 8 2 78.918 78.505 76.766
nursery 12,960 9 5 94.961 81.505 92.299

does not translate into higher latency as both architectures have similar learning
and inference times. We attribute that to the independence of the subnetwork
of each class which facilitates inference. Notice that the two architectures are
equally efficient only in the classification task (only aspect compared in Table 3)
and not on robustness computations. We mathematically proved the latter to be
more efficient in class-selective networks when using Algorithm 1.
In Table 2, we have the average inference time per instance for 25 UCI
datasets. When using memoisation, the inference time dropped by at least 50% in
all datasets we evaluated, proving that memoisation is a valuable tool to render
(C)SPNs more efficient. We can also infer from the experiments, that the rela-
tive reduction in computing time tends to increase with the number of instances
N . For datasets with more than 2000 instances, which are still relatively small,
adding memoisation already cut the inference time by more than 90%. That is a
promising result for large scale applications, as memoisation grows more effective
with number of data points. We can better observe the effect of memoisation by
plotting a histogram of the inference times as in Fig. 2, where we have 6 of the
UCI datasets of Table 2. We can see that memoisation concentrates the distri-
bution at lower time values, proving that most instances take advantage of the
cached results in a number of nodes in the network.
Towards Scalable and Robust Sum-Product Networks 417

Table 2. Average inference time and number of nodes visited per inference for a CSPN
with (+M) and without (–M) memoisation across 25 UCI datasets. The respective
+M
ratios (%) are the saved time or node visits, that is, 1 − −M . Robustness was computed
with precision of 0.004, which requires 8 passes through the network as per Algorithm 1.

Inference Time (s) # Nodes Evaluated


Dataset N |X| |Xc | –M +M % –M +M %
zoo 101 17 7 1.742 0.754 56.7 15,020 2,113 85.93
bridges 107 11 6 0.693 0.335 51.66 8,791 1,665 81.06
lymph 148 18 4 1.28 0.535 58.17 11,557 1,990 82.78
flags 194 29 8 5.44 1.641 69.84 55,670 5,973 89.27
autos 205 26 2 1.303 0.541 58.5 17,100 3,534 79.33
breast cancer 286 10 2 0.422 0.13 69.3 5,294 891 83.17
heart h 294 12 2 0.279 0.101 63.94 2,501 330 86.79
ecoli 336 6 8 0.891 0.164 81.62 8,703 663 92.38
liver disorders 345 7 2 0.172 4.936e−2 71.32 1,444 153 89.39
dermatology 366 35 6 5.747 1.276 77.8 57,239 3,623 93.67
colic 368 23 2 1.281 0.412 67.85 13,435 1,609 88.03
balance scale 625 5 3 0.15 2.264e−2 84.94 1,720 71 95.86
soybean 683 36 19 24.125 5.141 78.69 2.803e+5 5,831 97.92
diabetes 768 9 2 0.329 6.424e−2 80.49 3,028 201 93.37
vehicle 846 19 4 1.864 0.299 83.94 19,408 797 95.89
tic tac toe 958 10 2 0.679 0.132 80.56 7,693 519 93.25
vowel 990 14 11 12.126 1.791 85.23 1.252e+5 5,804 95.37
solar flare 2 1,066 12 6 3.467 0.364 89.51 22,320 707 96.83
cmc 1,473 10 3 1.43 0.209 85.36 17,688 775 95.62
car 1,728 7 4 0.769 0.124 83.92 9,720 412 95.76
segment 2,310 17 7 6.56 0.421 93.59 42,923 579 98.65
sick 3,772 28 2 1.844 0.15 91.89 11,412 146 98.72
hypothyroid 3,772 28 4 2.322 0.252 89.17 20,864 223 98.93
spambase 4,601 8 2 0.583 2.236e−2 96.16 6,430 75 98.83
nursery 12,960 9 5 8.27 0.661 92.01 74,418 1,088 98.54

If we consider the number of nodes visited instead of time, the results are
even more telling. In Table 2 on average, less than 15% of node evaluations was
necessary during inference with memoisation. That is a more drastic reduction
than what we observed when comparing processing times, which means there
still plenty of room for improvement in computational time with better data
structures or faster programming languages.
It is worth noting that memoisation is only effective over discrete variables.
When a subset of the variables is continuous, the memoisation will only be
effective in nodes whose scope contains only discrete variables, which is likely to
reduce the computational gains of memoisation. An alternative is to construct
the cache with ranges of values for the continuous variables. To be sure, that
is a form of discretisation that might worsen the performance of the model,
but it occurs only at inference time and can be easily switched off when high
precision is needed. A thorough study of the effect of memoisation on models
with continuous variables is left for future work.
418 A. H. C. Correia and C. P. de Campos

Fig. 2. Empirical distribution of CSPN inference times with and without memoisation.

5.1 Exploring Data on the Robustness Measure


We can interpret robustness as a measure of the model’s confidence on its output.
Roughly speaking, in a classification task, the robustness value  of a prediction
corresponds to how much we can tweak the networks parameters without chang-
ing the final result, that is, the class of maximum probability. Thus, a large 
means that many similar networks (in the parameter space) would give the same
output for the instance in question. Similarly, we can think that small changes
in the hyperparameters or the data sample would not produce a model whose
prediction would be different for that given instance. Conversely, a small  tell us
that slightly different networks would already provide us with a distinct answer.
In that case, the prediction is not reliable as it might fluctuate with any variation
on the learning or data acquisition processes.
We can validate this interpretation by investigating how robustness relates
to the accuracy of the model. In Fig. 3(a), we defined a number of robustness
thresholds and, for each of them, we computed the accuracy of the model over
instances for which  was above the threshold. It is clear from the graph that
the accuracy increases with the threshold, and we can infer that robustness
does translate into reliability as the model is more accurate over instances with
high  values. We arrive at a similar conclusion in Fig. 3(b), where we consider
instances for which  was below a given threshold. In this case, the curves start
at considerably lower accuracy values, where only examples with low robustness
are considered, and then build up as instances with higher  values are added.
Robustness values are not only a guide for reliable decision-making but are
also applicable to ensemble modelling. The idea is to combine a CSPN C =
{Sω : ω ∈ C} with another model f by defining a robustness threshold t. We
rely on the CSPN’s prediction for all instances for which its robustness is above
Towards Scalable and Robust Sum-Product Networks 419

Fig. 3. Accuracy of predictions with robustness (a) above and (b) below different
thresholds for 12 UCI datasets. Some curves end abruptly because we only computed
the accuracy when 50 or more data points were available for a given threshold.

Fig. 4. Performance of the ensemble model against different robustness thresholds for
12 UCI datasets. (a) Accuracy of the ensemble model; (b) accuracy of the ensemble
model over accuracy of the CSPN and (c) over accuracy of the XGBoost.

the threshold and, on f for the remaining ones. To be precise, we can define an
ensemble EC,f as 
x∗c if  ≥ t
EC,f (xE , t) =
f (xE ) otherwise,
420 A. H. C. Correia and C. P. de Campos

where x∗c = arg maxxc S(xc , xE ) is the class predicted by a class-selective SPN S
learned from data (from which the -contaminated CSPN C is built), and  is the
corresponding robustness value,  = Robustness(S, xE , x∗c )—see Algorithm 1.
We implemented such an ensemble model by combining the -contaminated
CSPN with an XGBoost model. We computed the accuracy of the ensemble for
different thresholds t over a range of UCI data sets, as reported in Fig. 4. In plot
(a), we see the accuracy varies considerably with the threshold t, which suggests
there is an optimum value for t. In the other two plots, we compare the ensemble
against the CSPN (b); and the XGBoost model (c). We computed the ratio of
the accuracy of the ensemble over the accuracy of the competing model, so that
any point above one indicates an improve in performance. For many datasets, the
ensemble delivered better results and in some cases was superior to both original
models for an appropriate robustness threshold. In spite of that, we have not
investigated how to find good thresholds, which we leave for future work. Yet,
we point out that the complexity of queries using the class-selective CSPN in
the ensemble will be the same as that of class-selective SPNs (the robustness
comes “for free”), since the binary search for the computation of the threshold
will not be necessary (we can run it for the pre-decided t only).

6 Conclusion
SPNs and their credal version are highly expressive deep architectures with
tractable inference, which makes them a promising option for large-scale appli-
cations. However, they still lag behind some other machine learning models in
terms of architectural optimisations and fast implementations. We address that
through the introduction of memoisation, a simple yet effective technique that
caches previously computed results in a node. In our experiments, memoisation
reduced the number of node evaluations by more than 85% and the inference
time by at least 50% (often much more). We believe this is a valuable new tool
to help bring (C)SPNs to large-scale applications where low latency is essential.
We also discussed a new architecture, class-selective (C)SPNs, that combine
efficient robustness computation with high accuracy on classification tasks, out-
performing general (C)SPNs. Even though they excel in discriminative tasks,
class-selective SPNs are still generative models fully endowed with the seman-
tics of graphical models. We demonstrated how their probabilistic semantics can
be brought to bear through their extension to Credal SPNs. Namely, we explored
how robustness values relate to the accuracy of the model and how one can use
them to develop ensemble models guided through principled decision-making.
We finally point out some interesting directions for future work. As demon-
strated here, class-selective (C)SPNs have proven to be powerful models in clas-
sification tasks, but they arbitrarily place the class variable in a privileged posi-
tion in the network. Future research might investigate how well class-selective
(C)SPNs fit the joint distribution and how they would fair in predicting other
variables. Memoisation also opens up new promising research avenues, notably
in how it performs on other inferences tasks such as MAP and how it can be
extended to accommodate continuous variables.
A

Learning (s) Inference (s) # Nodes Height # Parameters


Dataset N |X| |Xc | Gen CS Gen CS Gen CS Gen CS Gen CS
Appendix

zoo 101 17 7 0.35 0.435 2.744e−2 4.522e−2 250 419 8 5 74 125


bridges 107 11 6 0.228 0.358 1.529e−2 2.74e−2 154 289 7 5 39 71
lymph 148 18 4 0.605 0.598 2.209e−2 2.686e−2 362 446 8 8 91 115
flags 194 29 8 1.582 2.311 0.115 0.197 1,013 1,744 11 7 188 328
autos 205 26 2 1.644 1.652 2.953e−2 3.02e−2 958 971 12 11 211 221
breast cancer 286 10 2 0.382 0.418 7.835e−3 9.793e−3 253 304 9 9 51 59
heart h 294 12 2 0.22 0.21 4.193e−3 4.307e−3 131 138 6 6 37 39
ecoli 336 6 8 0.121 0.242 1.223e−2 2.789e−2 101 233 5 5 25 59
liver disorders 345 7 2 0.107 0.108 2.467e−3 2.5e−3 77 78 6 6 24 25
dermatology 366 35 6 2.971 2.802 0.161 0.171 1,834 1,952 15 10 383 408
colic 368 23 2 1.084 1.326 1.885e−2 2.405e−2 625 791 11 12 149 183
balance scale 625 5 3 0.11 0.112 4.258e−3 4.068e−3 85 82 7 6 26 24
soybean 683 36 19 4.308 4.969 0.763 1.125 2,604 3,913 16 9 596 940
diabetes 768 9 2 0.263 0.252 5.672e−3 5.564e−3 176 172 9 8 58 56
vehicle 846 19 4 1.075 1.482 3.717e−2 5.226e−2 585 830 12 11 186 272
tic tac toe 958 10 2 0.725 0.66 1.623e−2 1.568e−2 496 470 11 11 128 116
vowel 990 14 11 3.498 3.879 0.417 0.49 2,444 2,800 17 13 502 663
solar flare 2 1,066 12 6 1.03 0.827 6.738e−2 6.088e−2 784 708 12 10 158 148
cmc 1,473 10 3 1.156 1.136 3.775e−2 3.685e−2 828 812 15 14 198 193
car 1,728 7 4 0.523 0.556 2.337e−2 2.794e−2 370 434 11 10 81 92
segment 2,310 17 7 1.558 2.028 9.55e−2 0.142 888 1,301 13 11 263 419
sick 3,772 28 2 2.665 2.229 2.493e−2 1.926e−2 802 616 13 11 249 196
hypothyroid 3,772 28 4 2.577 2.541 4.509e−2 5.393e−2 728 869 12 13 224 274
spambase 4,601 8 2 0.649 0.599 1.205e−2 1.247e−2 353 364 10 12 115 119
nursery 12,960 9 5 4.967 4.363 0.258 0.226 3,437 3,025 19 18 755 669
Towards Scalable and Robust Sum-Product Networks

ing and average inference times (s), number of nodes, height and number of parameters.
Table 3. Comparison between General (Gen) and Class-Selective (CS) SPNs in learn-
In Table 3 we compare general and class-selective SPNs in terms of their archi-
421

tecture and processing times in classification tasks (no robustness computation).


422 A. H. C. Correia and C. P. de Campos

References
1. Amer, M.R., Todorovic, S.: Sum product networks for activity recognition. IEEE
Trans. Pattern Anal. Mach. Intell. 38(4), 800–813 (2016)
2. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, vol. 19, no. (6), pp. 785–794 (2016)
3. Cheng, W.C., Kok, S., Pham, H.V., Chieu, H.L., Chai, K.M.A.: Language model-
ing with sum-product networks. In: Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH, pp. 2098–2102
(2014)
4. Darwiche, A.: A differential approach to inference in bayesian networks. J. ACM
50(3), 280–305 (2003)
5. Delalleau, O., Bengio, Y.: Shallow vs. deep sum-product networks. In: Advances in
Neural Information Processing Systems, vol. 24. pp. 666–674. Curran Associates,
Inc. (2011)
6. Gens, R., Domingos, P.: Learning the structure of sum-product networks. In: Pro-
ceedings of the 30th International Conference on Machine Learning, vol. 28, pp.
229–264 (2013)
7. Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms, 5th
edn. Springer Publishing Company, Incorporated, Heidelberg (2012)
8. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
9. Maua, D.D., Conaty, D., Cozman, F.G., Poppenhaeger, K., de Campos, P.C.:
Robustifying sum-product networks. Int. J. Approximate Reasoning 101, 163–180
(2018)
10. Mei, J., Jiang, Y., Tu, K.: Maximum a posteriori inference in sum-product net-
works. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
11. Nath, A., Domingos, P.: Learning tractable probabilistic models for fault localiza-
tion. In: 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 1294–
1301 (2016)
12. Peharz, R.: Foundations of Sum-Product Networks for Probabilistic Modeling.
Ph.D. thesis, Graz University of Technology (2015)
13. Peharz, R., Gens, R., Domingos, P.: Learning selective sum-product networks. In:
Proceedings of the 31st International Conference on Machine Learning, vol. 32
(2014)
14. Peharz, R., Tschiatschek, S., Pernkopf, F., Domingos, P.: On Theoretical Properties
of Sum-Product Networks. In: Proceedings of the 18th International Conference
on Artificial Intelligence and Statistics (AISTATS), vol. 38, pp. 744–752 (2015)
15. Poon, H., Domingos, P.: Sum product networks: a new deep architecture. In: 2011
IEEE International Conference on Computer Vision Workshops (2011)
16. Pronobis, A., Rao, R.P.: Learning deep generative spatial models for mobile robots.
In: IEEE International Conference on Intelligent Robots and Systems, pp. 755–762
(2017)
17. Sguerra, B.M., Cozman, F.G.: Image classification using sum-product networks for
autonomous flight of micro aerial vehicles. In: 2016 5th Brazilian Conference on
Intelligent Systems (BRACIS), pp. 139–144 (2016)
18. Wang, J., Wang, G.: Hierarchical spatial sum-product networks for action recog-
nition in still images. IEEE Trans. Circuits Syst. Video Technol. 28(1), 90–100
(2018)
Learning Models over Relational Data: A
Brief Tutorial

Maximilian Schleich1 , Dan Olteanu1(B) , Mahmoud Abo-Khamis2 ,


Hung Q. Ngo2 , and XuanLong Nguyen3
1
University of Oxford, Oxford, UK
[email protected]
2
RelationalAI, Inc., Berkeley, USA
3
University of Michigan, Ann Arbor, USA
https://fdbresearch.github.io, https://www.relational.ai

Abstract. This tutorial overviews the state of the art in learning mod-
els over relational databases and makes the case for a first-principles
approach that exploits recent developments in database research.
The input to learning classification and regression models is a training
dataset defined by feature extraction queries over relational databases.
The mainstream approach to learning over relational data is to materi-
alize the training dataset, export it out of the database, and then learn
over it using a statistical package. This approach can be expensive as it
requires the materialization of the training dataset. An alternative app-
roach is to cast the machine learning problem as a database problem by
transforming the data-intensive component of the learning task into a
batch of aggregates over the feature extraction query and by computing
this batch directly over the input database.
The tutorial highlights a variety of techniques developed by the
database theory and systems communities to improve the performance of
the learning task. They rely on structural properties of the relational data
and of the feature extraction query, including algebraic (semi-ring), com-
binatorial (hypertree width), statistical (sampling), or geometric (dis-
tance) structure. They also rely on factorized computation, code special-
ization, query compilation, and parallelization.

Keywords: Relational learning · Query processing

1 The Next Big Opportunity

Machine learning is emerging as general-purpose technology just as computing


became general-purpose 70 years ago. A core ability of intelligence is the ability
to predict, that is, to turn the information we have into the information we
need. Over the last decade, significant progress has been made on improving

This project has received funding from the European Union’s Horizon 2020 research
and innovation programme under grant agreement No. 682588.
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 423–432, 2019.
https://doi.org/10.1007/978-3-030-35514-2_32
424 M. Schleich et al.

the quality of prediction by techniques that identify relevant features and by


decreasing the cost of prediction using more performant hardware.
According to a 2017 Kaggle survey on the state of data science and machine
learning among 16,000 machine learning practitioners [26], the majority of prac-
tical data science tasks involve relational data: in retail, 86% of used data is
relational; in insurance, it is 83%; in marketing, it is 82%; while in finance it
is 77%. This is not surprising. The relational model is the jewel in the data
management crown. It is one of the most successful Computer Science stories.
Since its inception in 1969, it has seen a massive adoption in practice. Relational
data benefit from the investment of many human hours for curation and normal-
ization and are rich with knowledge of the underlying domain modelled using
database constraints.
Yet the current state of affairs in building predictive models over relational
data largely ignores the structure and rich semantics readily available in rela-
tional databases. Current machine learning technology throws away this rela-
tional structure and works on one large training dataset that is constructed
separately using queries over relational databases.
This tutorial overviews on-going efforts by the database theory and systems
community to address the challenge of efficiently learning machine learning mod-
els over relational databases. It invariably only highlights some of the represen-
tative contributions towards this challenge, with an emphasis on recent contribu-
tions by the authors. The tutorial does not cover the wealth of approaches that
use arrays of GPUs or compute farms for efficient machine learning. It instead
puts forward the insight that an array of known and novel database optimization
and processing techniques can make feasible a wide range of analytics workloads
already on one commodity machine. There is still much to explore in the case
of one machine before turning to compute farms. A key practical benefit of this
line of work is energy-efficient, inexpensive analytics over large databases.
The organization of the tutorial follows the structure of the next sections.

2 Overview of Main Approaches to Machine Learning


over Relational Databases
The approaches highlighted in this tutorial are classified depending on how
tightly they integrate the data system, where the input data reside and the train-
ing dataset is constructed, and the machine learning library (statistical software
package), which casts the model training problem as an optimization problem.

2.1 No Integration of Databases and Machine Learning


By far the most common approach to learning over relational data is to use two
distinct systems, that is, the data system for managing the training dataset and
the ML library for model training. These two systems are thus distinct tools on
the technology stack with no integration between the two. The data system first
computes the training dataset as the result of a feature extraction query and
Learning Models over Relational Data 425

exports it as one table commonly in CSV or binary format. The ML library then
imports the training dataset in its own format and learns the desired model.
For the first step, it is common to use open source database management
systems, such as PostgreSQL or SparkSQL [57], or query processing libraries,
such as Python Pandas [33] and R dplyr [56]. Common examples for ML libraries
include scikit-learn [44], R [46], TensorFlow [1], and MLlib [34].
One advantage is the delegation of concerns: Database systems are used to
deal with data, whereas statistical packages are for learning models. Using this
approach, one can learn virtually any model over any database.
The key disadvantage is the non-trivial time spent on materializing, export-
ing, and importing the training dataset, which is commonly orders of magnitude
larger than the input database. Even though the ML libraries are much less
scalable than the data systems, in this approach they are thus expected to work
on much larger inputs. Furthermore, these solutions inherit the limitations of
both of their underlying systems, e.g., the maximum data frame size in R and
the maximum number of columns in PostgreSQL are much less than typical
database sizes and respectively number of model features.

2.2 Loose Integration of Databases and Machine Learning

A second approach is based on a loose integration of the two systems, with code
of the statistical package migrated inside the database system space. In this
approach, each machine learning task is implemented as a distinct user-defined
aggregate function (UDAF) inside the database system. For instance, there are
distinct UDAFs for learning: logistic regression models, linear regression models,
k-means, Principal Component Analysis, and so on. Each of these UDAFs are
registered in the underlying database system and there is a keyword in the query
language supported by the database system to invoke them. The benefit is the
direct interface between the two systems, with one single process running for
both the construction of the training dataset and learning. The database system
computes one table, which is the training dataset, and the learning task works
directly on it. Prime example of this approach is MADLib [23] that extends
PostgreSQL with a comprehensive library of machine learning UDAFs. The key
advantage of this approach over the previous one is better runtime performance,
since it does not need to export and import the (usually large) training dataset.
Nevertheless, one has to explicitly write a UDAF for each new model and opti-
mization method, essentially redoing the large implementation effort behind well-
established statistical libraries. Approaches discussed in the next sections also
suffer from this limitation, yet some contribute novel learning algorithms that
can be asymptotically faster than existing off-the-shelf ones.
A variation of the second approach provides a unified programming archi-
tecture, one framework for many machine learning tasks instead of one distinct
UDAF per task, with possible code reuse across UDAFs. Prime example of this
approach is Bismark [16], a system that supports incremental (stochastic) gra-
dient descent for convex programming. Its drawback is that its code may be less
426 M. Schleich et al.

Feature Extraction
DB ML Tool θ
Query

materialized output
Model

Batch of Queries Model Reformulation

Query Batch Optimization


Evaluation

Fig. 1. Structure-aware versus structure-agnostic learning over relational databases.

efficient than the specialized UDAFs. Code reuse across various models and opti-
mization problems may however speed up the development of new functionalities
such as new models and optimization algorithms.

2.3 Tight Integration of Databases and Machine Learning

The aforementioned approaches do not exploit the structure of the data residing
in the database. The next and final approach features a tight integration of the
data and learning systems. The UDAF for the machine learning task is pushed
into the feature extraction query and one single evaluation plan is created to
compute both of them. This approach enables database optimizations such as
pushing parts of the UDAFs past the joins of the feature extraction query. Prime
examples are Orion [29],which supports generalized linear models, Hamlet [30],
which supports logistic regression and naı̈ve Bayes, Morpheus [11], which linear
and logistic regression, k-means clustering, and Gaussian non-negative matrix
factorization, F [40,41,51], which supports ridge linear regression, AC/DC [3],
which supports polynomial regression and factorization machines [47–49], and
LMFAO [50], which supports a larger class of models including the previously
mentioned ones and decision trees [10], Chow-Liu trees [12], mutual information,
and data cubes [19,22].

3 Structure-Aware Learning

The tightly-integrated systems F [51], AC/DC [3], and LMFAO [50] are data
structure-aware in that they exploit the structure and sparsity of the database
to lower the complexity and drastically improve the runtime performance of the
Learning Models over Relational Data 427

learning process. In contrast, we call all the other systems structure-agnostic,


since they do not exploit properties of the input database. Figure 1 depicts the
difference between structure-aware (in green) and structure-agnostic (in red)
approaches. The structure-aware systems compile the model specification into
a set of aggregates, one per feature or feature interaction. This is called model
reformulation in the figure. Data dependencies such as functional dependencies
can be used to reparameterize the model, so a model over a smaller set of func-
tionally determining features is learned instead and then mapped back to the
original model. Join dependencies, such as those prevalent in feature extraction
queries that put together several input tables, are exploited to avoid redundancy
in the representation of join results and push the model aggregates past joins.
The model aggregates over the feature extraction query define a batch of queries.
In practice, for training datasets with tens of features, query batch sizes can be
in the order of: hundreds to thousands for ridge linear regression; thousands for
computing a decision tree node; and tens for an assignment step in k-means clus-
tering [50]. The result of a query batch is then the input to an optimizer such as
a gradient descent method that iterates until the model parameters converge.
Structure-aware methods have been developed (or are being developed) for a
variety of models [4]. Besides those mentioned above, powerful models that can
be supported are: Principal Component Analysis (PCA) [35], Support Vector
Machines (SVM) [25], Sum Product Networks (SPN) [45], random forests, boost-
ing regression trees, and AdaBoost. Newer methods also look at linear algebra
programs where matrices admit a database interpretation such as the results of
queries over relations. In particular, on-going work [17,24] tackles various matrix
decompositions, such as QR, Cholesky, SVD [18], and low-rank [54].
Structure-aware methods call for new data processing techniques to deal with
large query batches. Recent work puts forward new optimization and evaluation
strategies that go beyond the capabilities of existing database management sys-
tems. Recent experiments confirm this observation: Whereas existing query pro-
cessing techniques are mature at executing one query, they miss opportunities
for systematically sharing computation across several queries in a batch [50].
Tightly-integrated DB-ML systems commonly exploit four types of structure:
algebraic, combinatorial, statistical, and geometric.

Algebraic Structure. The algebraic structure of semi-rings underlies the recent


work on factorized databases [41,42]. The distributivity law in particular allows
to factor out data blocks common to several tuples, represent them once and
compute over them once. Using factorization, relations can represented more
succinctly as directed acyclic graphs. For instance, the natural join of two rela-
tions is a union of Cartesian products. Instead of representing such a Cartesian
product of two relation parts explicitly as done by relational database systems,
we can represent it symbolically as a tree whose root is the Cartesian prod-
uct symbol and has as children the two relation parts. It has been shown that
factorization can improve the performance of joins [42], aggregates [6,9], and
more recently machine learning [2,4,41,51]. The additive inverse of rings allows
to treat uniformly data updates (inserts and deletes) and enables incremental
428 M. Schleich et al.

maintenance of models learned over relational data [27,28,39]. The sum-product


abstraction in (semi) rings allows to use the same processing (computing and
maintaining) mechanism for seemingly disparate tasks, such as database queries,
covariance matrices, inference in probabilistic graphical models, and matrix chain
multiplication [6,39]. The efficient maintenance of covariance matrices is a pre-
requisite for the availability of fresh models under data changes [39]. A recent
tutorial overviews advances in incremental view maintenance [15].

Combinatorial Structure. The combinatorial structure prevalent in relational


data has been formalized by notions such as width and data degree measures.
If a feature extraction query has width w, then its data complexity is Õ(N w )
for a database of size N , where Õ hides logarithmic factors in N . Various width
measures have been proposed recently, such as: the fractional edge cover num-
ber [8,20,37,38,55] to capture the asymptotic size of the results for join queries
and the time to compute them; the fractional hypertree width [32] and the sub-
modular width [7] to capture the time to compute Boolean conjunctive queries;
the factorization width [42] to capture the size of the factorized results of con-
junctive queries; the FAQ-width [6] that extends the factorization width from
conjunctive queries to functional aggregate queries; and the sharp-submodular
width [2] that improves on the previous widths for functional aggregate queries.
The degree information captures the number of occurrences of a data value
in the input database [38]. Existing processing techniques adapt depending on
the high or low degree of data values. A recent such technique has been shown
to be worst-case optimal for incrementally maintaining the count of triangles in
a graph [27]. Another such technique achieves a low complexity for computing
queries with negated relations of bounded degree [5]. A special form of bounded
degree is given by functional dependencies, which can be used to reparameterize
(polynomial regression and factorization machine) models and learn simpler,
equivalent models instead [4].

Statistical Structure. The statistical structure allows to sample through joins,


such as the ripple joins [21] and the wander joins [31], and to sample for spe-
cific classes of machine learning models [43]. Sampling is employed whenever
the input database is too large to be processed within a given time budget. It
may nevertheless lead to approximation of both steps in the end-to-end learning
task, from the computation of the feature extraction query to the subsequent
optimization task that yields the desired model. Work in this space quantifies
the loss in accuracy of the obtained model due to sampling.

Geometric Structure. Algorithms for clustering methods such as k-means [35]


can exploit distance measures (such as the optimal transport distance between
two probability measures) to obtain constant-factor approximations for the k-
means objective by clustering over a small grid coreset instead of the full result
of the feature extraction query [14].
Learning Models over Relational Data 429

4 Database Systems Considerations


Besides exploiting the structure of the input data and the learning task, the
problem of learning models over databases can also benefit tremendously from
database system techniques. Recent work [50] showed non-trivial speedups (sev-
eral orders of magnitude) brought by code optimization for machine learning
workloads over state-of-the-art systems such as TensorFlow [1], R [46], Scikit-
learn [44], and mlpack [13]. Prime examples of code optimizations leading to
such performance improvements include:

Code Specialization and Query Compilation. It involves generating code specific


to the query and the schema of its input data, following prior work [36,52,53],
and also specific to the model to be learned. This technique improves the runtime
performance by inlining code and improving cache locality for the hot data path.

Sharing Computation. Sharing is best achieved by decomposing the aggregates


in a query batch into simple views that are pushed down the join tree of the
feature extraction query. Different aggregates may then need the same simple
views at some nodes in the join tree. Sharing of scans of the input relations can
also happen across views, even when they have different output schemas.

Parallelization. Parallelization can exploit multi-core CPU architectures but


also large share-nothing distributed systems. It comprises both task parallelism,
which identifies subqueries that are independent and can be computed in paral-
lel, and domain parallelism, which partitions relations and computes the same
subqueries over different parts in parallel.

This tutorial is a call to arms for more sustained and principled work on the
theory and systems of structure-aware approaches to data analytics. What are the
theoretical limits of structure-aware learning? What are the classes of machine
learning models that can benefit from structure-aware learning over relational
data? What other types of structure can benefit learning over relational data?

References
1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI,
pp. 265–283 (2016)
2. Abo Khamis, M., et al.: On functional aggregate queries with additive inequalities.
In: PODS, pp. 414–431 (2019)
3. Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: AC/DC: In-
database learning thunderstruck. In: DEEM, pp. 8:1–8:10 (2018)
4. Abo Khamis, M., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database
learning with sparse tensors. In: PODS, pp. 325–340 (2018)
5. Abo Khamis, M., Ngo, H.Q., Olteanu, D., Suciu, D.: Boolean tensor decomposition
for conjunctive queries with negation. In: ICDT, pp. 21:1–21:19 (2019)
6. Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In:
PODS, pp. 13–28 (2016)
430 M. Schleich et al.

7. Abo Khamis, M., Ngo, H.Q., Suciu, D.: What do shannon-type inequalities, sub-
modular width, and disjunctive datalog have to do with one another? In: PODS,
pp. 429–444 (2017)
8. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins.
In: FOCS, pp. 739–748 (2008)
9. Bakibayev, N., Kociský, T., Olteanu, D., Závodný, J.: Aggregation and ordering
in factorised databases. PVLDB 6(14), 1990–2001 (2013)
10. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression
Trees. Wadsworth and Brooks, Monterey (1984)
11. Chen, L., Kumar, A., Naughton, J.F., Patel, J.M.: Towards linear algebra over
normalized data. PVLDB 10(11), 1214–1225 (2017)
12. Chow, C., Liu, C.: Approximating discrete probability distributions with depen-
dence trees. IEEE Trans. Inf. Theor. 14(3), 462–467 (2006)
13. Curtin, R.R., Edel, M., Lozhnikov, M., Mentekidis, Y., Ghaisas, S., Zhang, S.:
mlpack 3: a fast, flexible machine learning library. J. Open Source Soft. 3, 726
(2018)
14. Curtin, R.R., Moseley, B., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: Rk-
means: fast coreset construction for clustering relational data (2019)
15. Elghandour, I., Kara, A., Olteanu, D., Vansummeren, S.: Incremental techniques
for large-scale dynamic query processing. In: CIKM, pp. 2297–2298 (2018). Tutorial
16. Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-
RDBMS analytics. In: SIGMOD, pp. 325–336 (2012)
17. van Geffen, B.: QR decomposition of normalised relational data (2018), MSc thesis,
University of Oxford
18. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins
University Press, Baltimore (2013)
19. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational aggre-
gation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, pp. 152–
159 (1996)
20. Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp.
289–298 (2006)
21. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp.
287–298 (1999)
22. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes effi-
ciently. In: SIGMOD, pp. 205–216 (1996)
23. Hellerstein, J.M., et al.: The madlib analytics library or MAD skills, the SQL.
PVLDB 5(12), 1700–1711 (2012)
24. Inelus, G.R.: Quadratically Regularised Principal Component Analysis over multi-
relational databases, MSc thesis, University of Oxford (2019)
25. Joachims, T.: Training linear SVMS in linear time. In: SIGKDD, pp. 217–226
(2006)
26. Kaggle: The State of Data Science and Machine Learning (2017). https://www.
kaggle.com/surveys/2017
27. Kara, A., Ngo, H.Q., Nikolic, M., Olteanu, D., Zhang, H.: Counting triangles under
updates in worst-case optimal time. In: ICDT, pp. 4:1–4:18 (2019)
28. Koch, C., Ahmad, Y., Kennedy, O., Nikolic, M., Nötzli, A., Lupei, D., Shaikhha,
A.: Dbtoaster: higher-order delta processing for dynamic, frequently fresh views.
VLDB J. 23(2), 253–278 (2014)
29. Kumar, A., Naughton, J.F., Patel, J.M.: Learning generalized linear models over
normalized data. In: SIGMOD, pp. 1969–1984 (2015)
Learning Models over Relational Data 431

30. Kumar, A., Naughton, J.F., Patel, J.M., Zhu, X.: To join or not to join?: thinking
twice about joins before feature selection. In: SIGMOD, pp. 19–34 (2016)
31. Li, F., Wu, B., Yi, K., Zhao, Z.: Wander join and XDB: online aggregation via
random walks. ACM Trans. Database Syst. 44(1), 2:1–2:41 (2019)
32. Marx, D.: Approximating fractional hypertree width. ACM Trans. Algorithms 6(2),
29:1–29:17 (2010)
33. McKinney, W.: pandas: a foundational python library for data analysis and statis-
tics. Python High Perform. Sci. Comput. 14 (2011)
34. Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res.
17(1), 1235–1241 (2016)
35. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cam-
bridge (2013)
36. Neumann, T.: Efficiently compiling efficient query plans for modern hardware.
PVLDB 4(9), 539–550 (2011)
37. Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In:
PODS, pp. 37–48 (2012)
38. Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: New developments in the theory
of join algorithms. In: SIGMOD Rec., pp. 5–16 (2013)
39. Nikolic, M., Olteanu, D.: Incremental view maintenance with triple lock factoriza-
tion benefits. In: SIGMOD, pp. 365–380 (2018)
40. Olteanu, D., Schleich, M.: F: regression models over factorized views. PVLDB
9(10), 1573–1576 (2016)
41. Olteanu, D., Schleich, M.: Factorized databases. SIGMOD Rec. 45(2), 5–16 (2016)
42. Olteanu, D., Závodný, J.: Size bounds for factorised representations of query
results. TODS 40(1), 2 (2015)
43. Park, Y., Qing, J., Shen, X., Mozafari, B.: Blinkml: efficient maximum likelihood
estimation with probabilistic guarantees. In: SIGMOD, pp. 1135–1152 (2019)
44. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
45. Poon, H., Domingos, P.M.: Sum-product networks: a new deep architecture. In:
UAI, pp. 337–346 (2011)
46. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
dation for Stat. Comp. (2013). www.r-project.org
47. Rendle, S.: Factorization machines. In: Proceedings of the 2010 IEEE International
Conference on Data Mining. ICDM 2010, pp. 995–1000. IEEE Computer Society,
Washington, DC (2010)
48. Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol.
3(3), 57:1–57:22 (2012)
49. Rendle, S.: Scaling factorization machines to relational data. PVLDB 6(5), 337–348
(2013)
50. Schleich, M., Olteanu, D., Abo Khamis, M., Ngo, H.Q., Nguyen, X.: A layered
aggregate engine for analytics workloads. In: SIGMOD, pp. 1642–1659 (2019)
51. Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over
factorized joins. In: SIGMOD, pp. 3–18 (2016)
52. Shaikhha, A., Klonatos, Y., Koch, C.: Building efficient query engines in a high-
level language. TODS 43(1), 4:1–4:45 (2018)
53. Shaikhha, A., Klonatos, Y., Parreaux, L., Brown, L., Dashti, M., Koch, C.: How
to architect a query compiler. In: SIGMOD, pp. 1907–1922 (2016)
54. Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found.
Trends Mach. Learn. 9(1), 1–118 (2016)
432 M. Schleich et al.

55. Veldhuizen, T.L.: Triejoin: a simple, worst-case optimal join algorithm. In: ICDT,
pp. 96–106 (2014)
56. Wickham, H., Francois, R., Henry, L., Müller, K., et al.: dplyr: a grammar of data
manipulation. R package version 0.4 3 (2015)
57. Zaharia, M., Chowdhury, M., et al.: Resilient distributed datasets: a fault-tolerant
abstraction for in-memory cluster computing. In: NSDI, p. 2 (2012)
Subspace Clustering and Some Soft
Variants

Marie-Jeanne Lesot(B)

Sorbonne Université, CNRS, LIP6, 75005 Paris, France


[email protected]

Abstract. Subspace clustering is an unsupervised machine learning task


that, as clustering, decomposes a data set into subgroups that are both
distinct and compact, and that, in addition, explicitly takes into account
the fact that the data subgroups live in different subspaces of the feature
space. This paper provides a brief survey of the main approaches that
have been proposed to address this task, distinguishing between the two
paradigms used in the literature: the first one builds a local similarity
matrix to extract more appropriate data subgroups, whereas the second
one explicitly identifies the subspaces, so as to dispose of more com-
plete information about the clusters. It then focuses on soft computing
approaches, that in particular exploit the framework of the fuzzy set the-
ory to identify both the data subgroups and their associated subspaces.

Keywords: Machine learning · Unsupervised learning · Subspace


clustering · Soft computing · Fuzzy logic

1 Introduction

In the unsupervised learning framework, the only available input is a set of data,
here considered to be numerically described by feature vectors. The aim is then
to extract information from the data, e.g. in the form of linguistic summaries
(see e.g. [17]), frequent value co-occurrences, as expressed by association rules, or
as clusters. The latter are subgroups of data that are both compact and distinct,
which means that any data point is more similar to points assigned to the same
group than to points assigned to other groups. These clusters provide insight to
the data structure and a summary of the data set.
Subspace clustering [3,28] is a refined form of the clustering task, where the
clusters are assumed to live in different subspaces of the feature space: on the
one hand, this assumption can help identifying more relevant data subgroups,
relaxing the need to use a single, global, similarity relation; on the other hand,
it leads to refine the identified data summaries, so as to characterise each cluster
through its associated subspace. These two points of view have led to the two
main families of subspace clustering approaches, that have slightly different aims
and definitions.

c Springer Nature Switzerland AG 2019


N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 433–443, 2019.
https://doi.org/10.1007/978-3-030-35514-2_33
434 M.-J. Lesot

This paper first discusses in more details the definition of the subspace clus-
tering task in Sect. 2 and presents in turn the two main paradigms, in Sects. 3
and 4 respectively. It then focuses on soft computing approaches that have been
proposed to perform subspace clustering, in particular fuzzy ones: fuzzy logic
tools have proved to be useful to all types of machine learning tasks, such as
classification, extraction of association rules or clustering. Section 5 describes
their applications to the case of subspace clustering. Section 6 concludes the
paper.

2 Subspace Clustering Task Definition


Clustering. Clustering aims at decomposing a data set into subgroups that are
both compact and separable: compactness imposes a high internal similarity for
points assigned to the same cluster; separability imposes a high dissimilarity for
points assigned to different clusters, so that the clusters are distinct one from
another. These two properties thus jointly justify the individual existence of each
of the extracted clusters.
There exist many approaches to address this task that can broadly be struc-
tured into five main families: hierarchical, partitioning, density-based, spectral
and, more recently, deep approaches. In a nutshell, hierarchical clustering identi-
fies multiple data partitions, represented in a tree structure called dendrogram,
that allows to vary the desired granularity level of the data decomposition into
subgroups. Partitioning approaches, that provide numerous variants to the semi-
nal k-means method, optimise a cost function that can be interpreted as a quan-
tisation error, i.e. assessing the approximation error when a data point is repre-
sented by the centre of the cluster it is assigned to. Density-based approaches,
exemplified by dbscan [9], group points according to a transitive neighbour rela-
tion and define cluster boundaries as low density regions. Spectral methods [23]
rely on diagonalising the pairwise similarity matrix, considering that two points
should be assigned to the same cluster if they have the same similarity profile to
the other points. They can also be interpreted as identifying connex components
in the similarity graph, whose nodes correspond to the data points and edges
are weighted by the pairwise similarity values. Deep clustering approaches [24]
are often based on an encoder-decoder architecture, where the encoder provides
a low dimension representation of the data, corresponding to the cluster repre-
sentation, that must allow to reconstruct the data in the decoding phase.

Subspace Clustering. Subspace clustering refines the clustering task by con-


sidering that each cluster lives in its own subspace of the whole feature space.
Among others, this assumption implies that there is not a single, global, distance
(or similarity) measure to compare the data points, defined in the whole feature
space: each cluster can be associated to its own, local, comparison measure,
defined in its corresponding subspace.
Subspace clustering cannot be addressed by performing local feature selec-
tion for each cluster: such an approach would first identify the clusters in the
Subspace Clustering and Some Soft Variants 435

global feature space before characterising them. Now subspace clustering aims at
extracting more appropriate clusters that can be identified in lower dimensional
spaces only. Reciprocally, first performing feature selection and then clustering
the data would impose a subspace common to all clusters. Subspace cluster-
ing addresses both subgroup and feature identification simultaneously, so as to
obtain better subgroups, defined locally.
Subspace clustering is especially useful for high dimensional data, due to the
curse of dimensionality that makes all distances between pairs of points have
very close values: it can be the case that there exists no dense data subgroup in
the whole feature space and that clusters can only be identified when considering
subspaces with lower dimensionality.

Two Main Paradigms. Numerous approaches for subspace clustering have


been proposed in the literature, offering a diversity similar to that of the general
clustering task. Two main paradigms can be distinguished, that focus on slightly
different objectives and lead to independent method developments, as sketched
below and described in more details in the next two sections.
The first category, presented in Sect. 3, exploits the hypothesis that the avail-
able data has been drawn from distinct subspaces so as to improve the clustering
results: the subspace existence is viewed as a useful intermediary tool for the clus-
tering aim, but their identification is not a goal in itself and they are not further
used. Methods in this category rely on deriving an affinity matrix from the data,
that captures local similarity between data points and can be used to cluster the
data, instead of a predefined global similarity (or distance) measure.
The second category, described in Sect. 4, considers that the subspaces in
which the clusters live provide useful information in themselves: methods in this
category aim at explicitly identifying these subspaces, so as to characterise the
identified clusters and extract more knowledge from the data. Methods in this
category rely on predefined forms of the subspaces and extract from the data
their optimal instanciation.

3 Learning an Affinity Matrix

As mentioned in the previous section, subspace clustering can be defined as


a clustering task in the case where the data have been drawn from a union
of low dimensional subspaces [15,28]. Such a priori knowledge is for instance
available in many computer vision applications, such as image segmentation,
motion segmentation or image clustering [7,15,18,22] to name a few.
Methods in this category rely on a two-step procedure that consists in first
learning a local affinity matrix and then performing spectral clustering on this
matrix: the affinity matrix learns local similarity (or distance) values for each
couple of data, instead of applying a global, predefined, measure.
Among others [15,28], self-expressive approaches first represent the data
points as linear combination of other data points, that must thus be in the
436 M.-J. Lesot

same subspace. More formally, they learn a self-representation matrix C, min-


imising the reconstruction cost X − XCF where ˙ F is the Frobenius norm.
An affinity matrix can then be defined as W = 12 (|C| + |C T |) and used for spec-
tral clustering. Various constraints on the self-representation matrix C can be
considered, adding penalisation terms to the reconstruction cost with various
norms. Some examples include Sparse Subspace Clustering, SSC [7,8], or Low-
Rank Representation, LRR [22]. In order to extend to the case of non-linear
subspaces, kernel approaches have been proposed [26,30,31], as well as, more
recently, deep learning methods that do not require to set a priori the consid-
ered non-linear data transformation. The latter can consist in applying SSC
to the latent features extracted by an auto-encoder architecture [27] or, more
intrinsically, to integrate a self-expressive layer between the encoder and decoder
steps [15,32].

4 Identifying the Subspaces

A second approach to subspace clustering considers that the subspaces in which


the clusters live are also interesting as such and provide useful insight to the
data structure. It thus provides an explicit representation of these subspaces,
whereas the methods described in the previous section only exploit the matrix
of the local distances between all data pairs.
Many approaches have been proposed to identify the subspaces associated to
each cluster, they can be organised into three categories discussed in turn below.
They also differ by the form of the subspaces they consider that can be hyper-
rectangles [3], vector subspaces [1,28] or hyperplanes of low dimension [29], to
name a few.

Bottom-Up Strategy. A first category of approaches starts from atomic clus-


ters with high density and very low dimensionality that are then iteratively fused
to build more complex clusters and subspaces.
This is for instance the case of the clique algorithm [3] that starts from
dense unit cubes, progressively combined to define clusters and subspaces as
maximal sets of adjacent cells, parallel to the axes. A final step provides a textual
description of each cluster that aims at being both concise and informative for
the user: it contains information about the involved dimensions and their value
boundaries, offering an enriched result as compared to the list of the points
assigned to the considered cluster.
Enclus [6] and mafia [5] follow the same bottom-up principle. The dif-
ferences come from the fact that enclus minimises the cell entropy instead of
maximising their densities and that mafia allows to consider an adaptive grid to
define the initial cube units according to the data distribution for each attribute.

Top-Down Strategy: Projected Clustering. A second category of meth-


ods applies a top-down exploration method, that progressively refines subspaces
Subspace Clustering and Some Soft Variants 437

initially defined as the whole feature space. The refinement step consists in pro-
jecting the data to subspaces, making it possible to identify the cluster structure
of the data even when there is no dense clusters in the full feature space.
Proclus [1] belongs to this framework of projected clustering, it identifies
three components: (i) clusters, (ii) associated dimensions that define axes-parallel
subspaces, as well as (iii) outliers, i.e. points that are assigned to none of the
clusters. The candidate projection subspaces are defined by the dimensions along
which the cluster members have the lowest dispersion. Orclus [2] is a variant of
proclus that allows to identify subspaces that are not parallel to the initial axes.

Partitioning Strategy: Optimising a Cost Function. A third category of


methods relies on the definition of a cost function that extends the classical and
seminal k-means cost function so as to integrate the desired subspaces associated
with the clusters. They are based on replacing the Euclidean distance used to
compare the data by weighted variants thereof, where the weights are attached
to each cluster so as to dispose of the local definition of the distance function:
a dimension associated with a large weight is interpreted as playing a major
role in the cluster definition, as small variations in this dimension lead large
increases in the distance value. The subspaces can thus be defined indirectly by
the weights attached to the dimensions. Some approaches impose the subspaces
to be axes-parallel, others allow for rotations.
The subspace clustering methods usually require as hyperparameter the
desired number of clusters, set in advance. Most of them apply an alternated
optimisation scheme, as performed by the k-means algorithm: given candidate
cluster definitions, they optimise the data assignment to the clusters and, given
a candidate assignment, they optimise the cluster description. The latter adds to
the traditional cluster centres the associated distance weights. The approaches
belonging to this category vary by the constraints imposed to these weights, as
illustrated by some examples below.
It can first be observed that, although proposed in another framework, the
Gaussian Mixture Model (GMM) clustering approach can be interpreted as
addressing the subspace clustering task: GMM associate each cluster with its
covariance matrix and computes the distance between a data point and the
centre of the cluster it is assigned to as a local Malahanobis distance. As a
consequence, for example, a dimension associated with a low variance can be
interpreted as highly characterising the cluster, and the cluster subspace can
be defined as the one spanned by the dimensions with minimal variances. Full
covariance matrices allow for general cluster subspaces, diagonal matrices impose
the subspaces to be parallel to the initial axes.
Another example is provided by the Fuzzy Subspace Clustering algorithm,
FSC [10], that, despite his name, provides crisp assignment of the data to the
clusters: it is named fuzzy because of the weights in [0, 1] attached to the
dimensions. Denoting n the number of data points, d the number of features,
xi = (xi1 , · · · , xid ) for i = 1..n the data, c the desired number of clusters,
cr = (cr1 , · · · , crd ) for r = 1..c the cluster centres, uri the assignment of data xi
438 M.-J. Lesot

to cluster r, wr = (wr1 , · · · , wrd ) for r = 1..c the dimension weights for cluster r
and η and q two hyperparameters, FSC considers the cost function
n 
 c d
 c 
 d
q q
JF SC = uri wrp (xip − crp )2 + η wrp (1)
i=1 r=1 p=1 r=1 p=1

c d
under the constraints uri ∈ {0, 1}, r=1 uri = 1 for all i and p=1 wrp = 1 for
all r. In this cost, the first term is identical to the k-means cost function when
replacing the Euclidean distance by a weighted one, the second term is required
so that the update equations are well defined [10]. The two terms are balanced
by the η hyperparameter. The first two constraints are identical to the k-means
ones, the third one forbids the trivial solution where all weights wrp = 0. The
q
q hyperparameter defining the exponent of the weights wrp is similar to the
fuzzifier m used in the fuzzy c-means algorithm to avoid converging to binary
weights wrp ∈ {0, 1} [20].
The Entropy Weighted k-means algorithm, EWKM [16], is an extension that
aims at controlling the sparsity of the dimension weights wrp , so that they tend
to equal 0, instead of being small but non-zero: to that aim, it replaces the second
term in JF SC with an entropy regularisation term, balanced with a γ parameter
n 
 c d
 c 
 d
JEW KM = uri wrp (xip − crp )2 + γ wrp log wrp (2)
i=1 r=1 p=1 r=1 p=1

under the same constraints. When γ tends to 0, it allows to control the sparsity
level of the wrp weights.

5 Soft Variants
This section aims at detailing partitioning approaches, introduced in the previous
section, in the case where the point assignment to the cluster is not binary but
soft, i.e., using the above notations, uri ∈ [0, 1] instead of uri ∈ {0, 1}: they
constitute subspace extensions of the fuzzy c-means algorithm, called fcm, and
its variants (see e.g. [14,21] for overviews).
First the Gustafson-Kessel algorithm [13] can be viewed as the fuzzy corre-
spondent of GMM discussed in the previous section: both use the Mahalanobis
distance and consider weighted assignments to the clusters. They differ by the
interpretation of these weights and by the cost function they consider: GMM
optimises the log-likelihood of the data, in a probabilistic modelling framework,
whereas Gustafson-Kessel considers a quantisation error, in a fcm manner.
Using the same notations as in the previous section, with the additional
hyperparameters m, called fuzzifier, and (αr )r=1..c ∈ R, the Attribute Weighted
Fuzzy c-means algorithm, AWFCM [19], is based on the cost function
n 
 c d

JAW F CM = um
ri
q
wrp (xip − crp )2 (3)
i=1 r=1 p=1
Subspace Clustering and Some Soft Variants 439

c n
with uri ∈ [0, 1] and under the constraints r=1 uri = 1 for all i, i=1 uri > 0
d
for all r and p=1 wrp = αr for all r. The cost function is thus identical to
the fcm one, replacing the Euclidean distance with its weighted variant. The
first two constraints also are identical to the fcm ones, the third one forbids the
trivial solution wrp = 0. The (αr ) hyperparameters can also allow to weight the
relative importance of the c clusters in the final partition, but they are usually
set to be all equal to 1 [19].
Many variants of AWFCM have been proposed, for instance to introduce
sparsity in the subspace description: AWFCM indeed produces solutions where
none of the wrp parameters equals zero, even if they are very small. This is
similar to a well-known effect of the fcm, where the optimisation actually leads
to uri ∈ ]0, 1[, except for data points that are equal to a cluster centre: the
membership degrees can be very small, but they cannot equal zero [20]. Borgelt
[4] thus proposes to apply the sparsity inducing constraints introduced for the
membership degrees uri [20], considering
n 
 c d

JBOR = g(uri ) g(wrp )(xip − crp )2 (4)
i=1 r=1 p=1

under the same constraints as AWFCM, where


1−β 2 2β
g(x) = x + x with β ∈ [0, 1[
1+β 1+β
Two different β values can be respectively considered for the membership
degrees uri and the dimension weights wrp : setting β = 0 leads to the same func-
tion as AWFCM with m = 2 and q = 2, which are the traditional choices [19].
Considering a non-zero β value allows to get uri = 0 or wrp = 0 [20], providing a
sparsity property, both for the membership degrees and the dimension weights.
The Weighted Laplacian Fuzzy Clustering algorithm, WLFC [11], is another
variant of AWFCM that aims at solving an observed greediness of this algorithm:
AWFCM sometimes appears to be over-efficient and to fail to respect the global
geometry of the data, because of its adaptation to local structure [11]. To address
this issue, WLFC proposes to add a regularisation term to the cost function, so
as to counterbalance the local effect of cluster subspaces:

 c
n  d
 n 
 c
q
JW LF C = u2ri wrp (xip − crp ) + γ
2
(uri − usi )2 sij (5)
i=1 r=1 p=1 i,j=1 r=1

under the same constraints as AWFCM. In this cost, sij is a well-chosen global
similarity measure [11] that imposes that neighbouring points in the whole fea-
ture space still have somewhat similar membership degrees. The γ hyperparam-
eter allows to balance the two effects and to prevent some discontinuity in the
solution among point neighbourhood.
440 M.-J. Lesot

The Proximal Fuzzy Subspace C-Means, PFSCM [12] considers the cost func-
tion defined as
n 
 c d
 c
 d

JP F SCM = um
ri
2
wrp (xip − crp )2 + γ | (wrp ) − 1| (6)
i=1 r=1 p=1 r=1 p=1

under the first two constraints of AWFCM: the second term can be interpreted
as an inline version of the third constraint that is thus moved within the cost
function. As it is not differentiable, PFSCM proposes an original optimisation
scheme that does not rely on standard alternate optimisation but on proximal
descent (see e.g. [25]). This algorithm appears to identify better the number
of relevant dimensions for each cluster, where AWFCM tends to underestimate
it. Moreover, the proposition to apply proximal optimisation techniques to the
clustering task opens the way for defining a wide range of regularisation terms: it
allows for more advanced penalty terms that are not required to be differentiable.

6 Conclusion
This paper proposed a brief overview of the subspace clustering task and the
main categories of methods proposed to address it. They differ in the under-
standing of the general aim and offer a large variety of approaches that provide
different types of outputs and knowledge extracted from the data.
Still they have several properties in common. First most of them rely on a
non-constant distance measure: the comparison of two data points does not rely
on a global measure, but on a local one, that somehow takes the assignment to the
same cluster as a parameter to define this measure. As such, subspace clustering
constitutes a task that must extract from the data, in an unsupervised way, both
compact and distinct data subgroups, as well as the reasons why these subgroups
can be considered as compact. This makes it clear that subspace clustering is a
highly demanding and difficult task, that aims at exploiting inputs with little
information (indeed, inputs reduce to the data position in the feature space only)
to extract very rich knowledge.
Moreover, it can be observed that many subspace clustering methods share
a constraint of sparsity: it imposes subspaces to be as small as possible so as
to contain the clusters, while avoiding to oversimplify their complexity. A large
variety of criteria to define sparsity and integrate it into the task objective is
exploited across the existing approaches.
Among the directions for ongoing works in the subspace clustering domain,
a major one deals with the question of evaluation: as is especially the case for
any unsupervised learning task, there is no consensus about the quality criteria
to be used to assess the obtained results. The first category of methods, that
exploit the subspace existence to learn an affinity matrix, usually focuses on
evaluating the cluster quality: they resort to general clustering criteria, such
as the clustering error, measured as accuracy, the cluster purity or Normalised
Mutual Information. Thus they are often evaluated in a supervised manner,
Subspace Clustering and Some Soft Variants 441

considering a reference of expected data subgroups. When subspace clustering is


understood as also characterising the clusters using the subspaces in which they
live, the evaluation must also assess these extracted subspaces, e.g. taking into
account both their adequacy and sparsity. The definition of corresponding quality
criteria still constitutes an open question in the subspace clustering domain.

Acknowledgements. I wish to thank Arthur Guillon and Christophe Marsala with


whom I started exploring the domain of subspace clustering.

References
1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for
projected clustering. In: Proceedings of the International Conference on Manage-
ment of Data, SIGMOD, pp. 61–72. ACM (1999)
2. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimen-
sional spaces. In: Proceedings of the International Conference on Management of
Data, SIGMOD, pp. 70–81. ACM (2000)
3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Proceedings of
the ACM SIGMOD International Conference on Management of Data, SIGMOD,
pp. 94–105. ACM (1998)
4. Borgelt, C.: Fuzzy subspace clustering. In: Fink, A., Lausen, B., Seidel, W., Ultsch,
A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence.
Studies in Classification, Data Analysis, and Knowledge Organization, pp. 93–103.
Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-01044-6 8
5. Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: a maximal frequent itemset algo-
rithm for transactional databases. In: Proceedings of the 17th International Con-
ference on Data Engineering, pp. 443–452 (2001)
6. Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for min-
ing numerical data. In: Proceedings of the 5th ACM International Conference on
Knowledge Discovery and Data Mining, pp. 84–93 (1999)
7. Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: Proceedings of the IEEE
International Conference on Computer Vision and Pattern Recognition, CVPR,
pp. 2790–2797 (2009)
8. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and appli-
cations. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)
9. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In: Proceedings of the 2nd
ACM International Conference on Knowledge Discovery and Data Mining, KDD,
pp. 226–231 (1996)
10. Gan, G., Wu, J.: A convergence theorem for the fuzzy subspace clustering algo-
rithm. Pattern Recogn. 41(6), 1939–1947 (2008)
11. Guillon, A., Lesot, M.J., Marsala, C.: Laplacian regularization for fuzzy subspace
clustering. In: Proceedings of the IEEE International Conference on Fuzzy Systems,
FUZZ-IEEE 2017 (2017)
12. Guillon, A., Lesot, M.J., Marsala, C.: A proximal framework for fuzzy subspace
clustering. Fuzzy Sets Syst. 366, 24–45 (2019)
442 M.-J. Lesot

13. Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In:
Proceedings of the IEEE Conference on Decision and Control, vol. 17, pp. 761–
766. IEEE (1978)
14. Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis: Methods
for Classification, Data Analysis and Image Recognition. Wiley, New York (1999)
15. Ji, P., Zhang, T., Li, H., Salzmann, M., Reid, I.: Deep subspace clustering net-
works. In: Proceedings of the 31st International Conference on Neural Information
Processing Systems, NIPS (2017)
16. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for
subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data
Eng. 19(8), 1026–1041 (2007)
17. Kacprzyk, J., Zadrozny, S.: Linguistic database summaries and their protoforms:
towards natural language based knowledge discovery tools. Inf. Sci. 173(4), 281–
304 (2005)
18. Kanatani, K.: Motion segmentation by subspace separation and model selection.
In: Proceedings of the 8th International Conference on Computer Vision, ICCV,
vol. 2, pp. 586–591 (2001)
19. Keller, A., Klawonn, F.: Fuzzy clustering with weighting of data variables. Int. J.
Uncertain. Fuzziness Knowl. Based Syst. 8(6), 735–746 (2000)
20. Klawonn, F., Höppner, F.: What is fuzzy about fuzzy clustering? Understanding
and improving the concept of the fuzzifier. In: Proceedings of the 5th International
Symposium on Intelligent Data Analysis, pp. 254–264 (2003)
21. Kruse, R., Döring, C., Lesot, M.J.: Fundamentals of fuzzy clustering. In: de
Oliveira, J., Pedrycz, W. (eds.) Advances in Fuzzy Clustering and its Applica-
tions. Wiley, New York (2007)
22. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace
structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell.
35(1), 171–184 (2013)
23. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416
(2007)
24. Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., Lun, J.: A survey of clustering
with deep learning: from the perspective of network architecture. IEEE Access 6,
39501–39514 (2018)
25. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231
(2014)
26. Patel, V.M., Vidal, R.: Kernel sparse subspace clustering. In: Proceedings of ICIP,
pp. 2849–2853 (2014)
27. Peng, X., Xiao, S., Feng, J., Yau, W.Y., Yi, Z.: Deep subspace clustering with spar-
sity prior. In: Proceedings of the 25th International Joint Conference on Artificial
Intelligence, IJCAI, pp. 1925–1931 (2016)
28. Vidal, R.: A tutorial on subspace clustering. IEEE Sig. Process. Mag. 28(2), 52–68
(2010)
29. Wang, D., Ding, C., Li, T.: K-subspace clustering. In: Buntine, W., Grobelnik,
M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI),
vol. 5782, pp. 506–521. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-04174-7 33
30. Xiao, S., Tan, M., Xu, D., Dong, Z.Y.: Robust kernel low-rank representation.
IEEE Trans. Neural Netw. Learn. Syst. 27(11), 2268–2281 (2016)
Subspace Clustering and Some Soft Variants 443

31. Yin, M., Guo, Y., Gao, J., He, Z., Xie, S.: Kernel sparse subspace clustering on
symmetric positive definite manifolds. In: Proceedings of the IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5157–5164
(2016)
32. Zhou, L., Bai, X., Wang, D., Liu, X., Zhou, J., Hancock, E.: Latent distribution
preserving deep subspace clustering. In: Proceedings of the 28th International Joint
Conference on Artificial Intelligence, IJCAI, pp. 4440–4446 (2019)
Invited Keynotes
From Shallow to Deep Interactions Between Knowledge
Representation, Reasoning
and Machine Learning

Kay R. Amel(B)

GDR “Aspects Formels et Algorithmiques de l’Intelligence Artificielle”,


CNRS, Gif-sur-Yvette, France

Abstract. Reasoning and learning are two basic concerns at the core of Artificial
Intelligence (AI). In the last three decades, Knowledge Representation and Rea-
soning (KRR) on the one hand and Machine Learning (ML) on the other hand,
have been considerably developed and have specialised in a large number of ded-
icated sub-fields. These technical developments and specialisations, while they
were strengthening the respective corpora of methods in KRR and in ML, also
contributed to an almost complete separation of the lines of research in these two
areas, making many researchers on one side largely ignorant of what was going
on the other side.
This state of affairs is also somewhat relying on general, overly simplis-
tic, dichotomies that suggest great differences between KRR and ML: KRR
deals with knowledge, ML handles data; KRR privileges symbolic, discrete
approaches, while numerical methods dominate ML. Even if such a rough pic-
ture points out things that cannot be fully denied, it is also misleading, as for
instance KRR can deal with data as well (e.g., formal concept analysis) and
ML approaches may rely on symbolic knowledge (e.g., inductive logic program-
ming). Indeed, the frontier between the two fields is actually much blurrier than
it appears, as both share approaches such as Bayesian networks, or case-based
reasoning and analogical reasoning, as well as important concerns such as uncer-
tainty representation. In fact, one may well argue that similarities between the
two fields are more numerous than one may think.
This talk proposes a tentative and original survey of meeting points between
KRR and ML. Some common concerns are first identified and discussed such as

Kay R. Amel is the pen name of the working group “Apprentissage et Raisonnement” of
the GDR (“Groupement De Recherche”) “Aspects Formels et Algorithmiques de l’Intelligence
Artificielle”, CNRS, France (https://www.gdria.fr/presentation/). The contributors to this
paper include: Zied Bouraoui (CRIL, Lens, Fr, [email protected]), Antoine Cornuéjols
(AgroParisTech, Paris, Fr, [email protected]), Thierry Denoeux (Heudiasyc,
Compiègne, Fr, [email protected]), Sébastien Destercke (Heudiasyc, Compiègne, Fr,
[email protected]), Didier Dubois (IRIT, Toulouse, Fr, [email protected]), Romain
Guillaume (IRIT, Toulouse, Fr, [email protected]), Jérôme Mengin (IRIT, Toulouse,
Fr, [email protected]), Henri Prade (IRIT, Toulouse, Fr, [email protected]), Steven Schock-
aert (School of Computer Science and Informatics, Cardiff, UK, [email protected]),
Mathieu Serrurier (IRIT, Toulouse, Fr, [email protected]), Christel Vrain (LIFO,
Orléans, Fr, [email protected]).

c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 447–448, 2019.
https://doi.org/10.1007/978-3-030-35514-2
448 K. R. Amel

the types of representation used, the roles of knowledge and data, the lack or the
excess of information, the need for explanations and causal understanding.
Then some methodologies combining reasoning and learning are reviewed
(such as inductive logic programming, neuro-symbolic reasoning, formal concept
analysis, rule-based representations and machine learning, uncertainty assess-
ment in prediction, or case-based reasoning and analogical reasoning), before
discussing examples of synergies between KRR and ML (including topics such
as belief functions on regression, EM algorithm versus revision, the semantic
description of vector representations, the combination of deep learning with high
level inference, knowledge graph completion, declarative frameworks for data
mining, or preferences and recommendation).
The full paper will be the first step of a work in progress aiming at a bet-
ter mutual understanding of researches in KRR and ML, and how they could
cooperate.
Algebraic Approximations for Weighted Model
Counting

Wolfgang Gatterbauer(B)

Khoury College of Computer Sciences, Northeastern University, Boston, USA


[email protected]

Abstract. It is a common approach in computer science to approximate


a function that is hard to evaluate by a simpler function. Finding such
fast approximations is especially important for probabilistic inference,
which is widely used, yet notoriously hard. We discuss a recent algebraic
approach for approximating the probability of Boolean functions with
upper and lower bounds. We give the intuition for these bounds and
illustrate their use with three applications: (1) anytime approximations
of monotone Boolean formulas, (2) approximate lifted inference with rela-
tional databases, and (3) approximate weighted model counting.

1 Probabilistic Inference and Weighted Model Counting


Probabilistic inference over large data sets has become a central data manage-
ment problem. It is at the core of a wide range of approaches, such as graphi-
cal models, statistical relational learning or probabilistic databases. Yet a major
drawback of exact probabilistic inference is that it is computationally intractable
for most real-world problems. Thus developing general and scalable approximate
schemes is a subject of fundamental interest. We focus on weighted model count-
ing, which is a generic inference problem to which all above approaches can be
reduced. It is essentially the same problem as computing the probability of a
Boolean formula. Each truth assignment of the Boolean variables corresponds to
one model whose weight is the probability of this truth assignment. Weighted
model counting then asks for the sum of the weights of all satisfying assignments.

2 Optimal Oblivious Dissociation Bounds

We discuss recently developed deterministic upper and lower bounds for the
probability of Boolean functions. The bounds result from treating multiple occur-
rences of variables as independent and assigning them new individual probabil-
ities, an approach called dissociation. By performing several dissociations, one
can transform a Boolean formula whose probability is difficult to compute, into
one whose probability is easy to compute. Appropriately executed, these steps
can give rise to a novel class of inequalities from which upper and lower bounds
can be derived efficiently. In addition, the resulting bounds are oblivious, i.e. they
c Springer Nature Switzerland AG 2019
N. Ben Amor et al. (Eds.): SUM 2019, LNAI 11940, pp. 449–450, 2019.
https://doi.org/10.1007/978-3-030-35514-2
450 W. Gatterbauer

require only limited observations of the structure and parameters of the prob-
lem. This technique can yield fast approximate schemes that generate upper and
lower bounds for various inference tasks.

3 Talk Outline
We discuss Boolean formulas and their connection to weighted model counting.
We introduce dissociation-based bounds and draw the connection to approximate
knowledge compilation. We then illustrate the use of dissociation-based bounds
with three applications: (1) anytime approximations of monotone Boolean for-
mulas [7]. (2) approximate lifted inference with relational databases [4, 5], and
(3) approximate weighted model counting [3]. If time remains, we will discuss
the similarities and differences to four other techniques that similarly fall into
Pearl’s classification of extensional approaches to uncertainty [9, Ch 1.1.4]: (i)
relaxation-based methods in logical optimization [8, Ch 13], (ii) relaxation &
compensation for approximate probabilistic inference in graphical models [2],
(iii) probabilistic soft logic that uses continuous relaxations in a smart way [1],
and (iv) quantization on algebraic decision diagrams [6]. The slides will be made
available at https://northeastern-datalab.github.io/afresearch/.

Acknowledgements. This work is supported in part by National Science Foundation


grant IIS-1762268. I would also like to thank my various collaborators on the topic: Li
Chou, Floris Geerts, Vibhav Gogate, Peter Ivanov, Dan Suciu, Martin Theobald, and
Maarten Van den Heuvel.

References
1. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss markov random fields
and probabilistic soft logic. J. Mach. Learn. Res. 18, 109:1–109:67 (2017)
2. Choi, A., Darwiche, A.: Relax then compensate: on max-product belief propagation
and more. In: NIPS, pp. 351–359 (2009)
3. Chou, L., Gatterbauer, W., Gogate, V.: Dissociation-based oblivious bounds for
weighted model counting. In: UAI (2018)
4. Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic
databases. PVLDB 8(5), 629–640 (2015)
5. Gatterbauer, W., Suciu, D.: Dissociation and propagation for approximate lifted
inference with standard relational database management systems. VLDB J. 26(1),
5–30 (2017)
6. Gogate, V., Domingos, P.: Approximation by quantization. In: UAI, pp. 247–255
(2011)
7. den Heuvel, M.V., Ivanov, P., Gatterbauer, W., Geerts, F., Theobald, M.: Anytime
approximation in probabilistic databases via scaled dissociations. In: SIGMOD, pp.
1295–1312 (2019)
8. Hooker, J.: Logic-Based Methods for Optimization: Combining Optimization and
Constraint Satisfaction. John Wiley & sons (2000)
9. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-
ence. Morgan Kaufmann (1988)
Author Index

Abo-Khamis, Mahmoud 423 Jacquin, Lucie 122


Amel, Kay R. 447 Jamshidi, Pooyan 324
Amor, Nahla Ben 355 Javidian, Mohammad Ali 324
Antoine, Violaine 66
Ayachi, Raouia 355 Kawasaki, Tatsuki 79
Kuhlmann, Isabelle 24
Bartashevich, Palina 310
Benabbou, Nawal 221
Benferhat, Salem 207 L’Héritier, Cécile 107
Bonifati, Angela 250 Labreuche, Christophe 192
Bounhas, Myriam 136, 339 Lagrue, Sylvain 280
Bouraoui, Zied 207 Lawry, Jonathan 310
Bourdache, Nadjet 93 Lesot, Marie-Jeanne 433
Bouslama, Rihab 355 Lust, Thibaut 221

Chaveroche, Maxime 390


Cherfaoui, Véronique 390 Martin, Hugo 52
Correia, Alvaro H. C. 409 Mercier, David 382
Couso, Ines 266 Montmain, Jacky 122
Crosscombe, Michael 310 Moriguchi, Sosuke 79
Mutmainah, Siti 382
Davoine, Franck 390
de Campos, Cassio P. 409 Ngo, Hung Q. 423
Denœux, Thierry 368 Nguyen, XuanLong 423
Destercke, Sébastien 266, 280, 289
Diaz, Amaia Nazabal Ruiz 250
Dubois, Didier 153, 169 Olteanu, Dan 423
Dumbrava, Stefania 250
Dusserre, Gilles 107 Papini, Odile 207
Perny, Patrice 52, 93
Gatterbauer, Wolfgang 449 Perrin, Didier 122
Gonzales, Christophe 404 Pichon, Frédéric 382
Guillot, Pierre-Louis 289 Pirlot, Marc 339
Potyka, Nico 236
Hachour, Samir 382 Prade, Henri 136, 153, 169, 339
Harispe, Sébastien 107
Hüllermeier, Eyke 266
Renooij, Silja 38
Imoussaten, Abdelhak 107, 122 Roig, Benoît 107
452 Author Index

Salhi, Yakoub 184 Valtorta, Marco 324


Schleich, Maximilian 423 van der Gaag, Linda C. 38
Sobrie, Olivier 339 Vuillemot, Romain 250
Spanjaard, Olivier 93
Wilson, Nic 169
Takahashi, Kazuko 79 Würbel, Eric 207
Thimm, Matthias 1, 9, 24
Tong, Zheng 368 Xie, Jiarui 66
Trousset, François 122 Xu, Philippe 368

You might also like