A Semantic Loss For Ontology Classification

A semantic loss for ontology classification
Simon Flügel1[0000−0003−3754−9016] , Martin Glauer1[0000−0001−6772−1943] , Till

Mossakowski1[0000−0002−8938−5204] , and Fabian Neuhaus1[0000−0002−1058−3102]
Otto von Guericke University Magdeburg, Germany

{sfluegel, martin.glauer, till.mossakowski, fneuhaus}@ovgu.de
arXiv:2405.02083v1 [cs.AI] 3 May 2024
Abstract. Deep learning models are often unaware of the inherent con-
straints of the task they are applied to. However, many downstream tasks
require logical consistency. For ontology classification tasks, such con-
straints include subsumption and disjointness relations between classes.
In order to increase the consistency of deep learning models, we propose
a semantic loss that combines label-based loss with terms penalising
subsumption- or disjointness-violations. Our evaluation on the ChEBI
ontology shows that the semantic loss is able to decrease the number of
consistency violations by several orders of magnitude without decreasing
the classification performance. In addition, we use the semantic loss for
unsupervised learning. We show that this can further improve consistency
on data from a distribution outside the scope of the supervised training.
Keywords: Semantic loss · ontology classification · ChEBI
1 Introduction
Deep learning models have been successfully applied to a wide range of clas-
sification tasks over the past years, often replacing hand-crafted features with
end-to-end feature learning [7,3]. This approach is based on the assumption that
all the knowledge required to solve a specific classification task is available in
the data used. These systems are often built for a specific use case. In the case
of a classification problem, emphasis is placed on the correct classification of
the input data and the success of a system is measured in its ability to cor-
rectly perform this task. However, this approach disregards that there are often
domain-specific logical constraints between different classification targets.
These logical constraints can be of great importance, as applications, espe-
cially those leading to further development, are often based on the assumption
that inputs are logically consistent. Imagine a system consisting of two com-
ponents in an autonomous vehicle. The first component recognises and labels
objects using a deep learning model. Based on this output, a rule-based system
built by experts determines the direction of travel. A contradictory classification
of the first system, e.g. a traffic light as both red and green, or a road user as a
pedestrian and as a car, can have fatal consequences, as the control system may
not cover such a scenario.
It is therefore important to prime systems towards logical consistency. Learn-
ing concepts by example is not optimally suited to adhere to domain-specific
constraints out of to box. Instead, reliance is placed on the fact that the corre-
sponding constraints are represented in the data and that the model can approx-
imate them accordingly during training. However, this approach has significant
disadvantages. Firstly, it assumes that there is a sufficiently large amount of data
so that the corresponding constraints are well represented. Secondly, the system
is deprived of important information that is readily available in the domain.
Thirdly, it creates an additional, implicit learning task that is not adequately
represented by the loss function.
For many research domains, ontologies exist that define important concepts
and their relations via logical constraints [20,4,2]. Ontologies therefore provide
a necessary logical axiomatisation that can be used to check the consistency
of models and to prime them for consistency. For instance, the subsumption
relation A is-a B requires that every entity classified as A is also classified as
B. Usually, this knowledge is not explicitly given to machine learning models
trained on concepts from an ontology. Instead, it can only be derived implicitly
from seeing a large enough number of A samples that are also B samples.
The aim of this paper is to integrate symbolic knowledge from ontologies into
the learning process of a machine learning model. To this end, in Section 3, we
present a semantic loss that extends regular loss functions by additional terms
that ensure the model’s coherence with ontological constraints. In Section 4, we
introduce a classification task on the ChEBI ontology and appropriate evaluation
metrics. These are used in Section 5, where we evaluate different semantic loss
variants. The results are discussed in Section 6 and a conclusion is drawn in
Section 7.
2 Related Work
A well-studied field within Machine Learning are hierarchical multi-label classi-
fication tasks, in which labels are structured in a hierarchy, similar the subsump-
tion relations in an ontology. However, it is usually assumed that each class only
has one superclass which allows the assigning of hierarchy levels.Many models
use these levels directly in their architecture [27,5]. In ontologies such as ChEBI,
many classes have multiple superclasses, which makes the assignment of hier-
archy levels non-trivial. In addition, ontologies include different kinds of logical
relations between classes, such as disjointness or parthood relations. Therefore,
our tasks requires a more general approach towards ensuring logical consistency.
Among the approaches that have integrated logical constraints into neural
networks, one of the earliest have been “Knowledge-Based Artificial Neural Net-
works” (KBANN, [26]), which attempted to directly represent formulae in propo-
sitional logic within the network structure. During training, the system is able
to adapt these structures to better fit the training data. This allows the priming
of a learning system with prior knowledge.
The training process of neural networks is usually based on a form of gra-
dient descent. Consequently, in order to allow answers as truth values {0, 1},
one must allow arbitrary predictions from [0, 1] in order to remain differentiable.
This naturally leads to an interpretation of these values as values from a many-
valued logic such as fuzzy logic or probabilistic logic. Indeed, there have been
many approaches that aim to combine fuzzy systems and neural networks [22,30].
These systems are particularly useful when training data is limited. In a recent
work [16,19], we applied an ontology-based neuro-fuzzy controller. The approach
in this paper is inspired by this work, in which we also apply a semantic penalty
system to ensure logically sound rules.
DeepProbLog [23] follows a probabilistic interpretation of prediction val-
ues. This approach is based on the probabilistic logic programming framework
ProbLog [9]. ProbLog allows the expression of Prolog-like inference rules with ad-
ditional uncertainty annotations, e.g. 0.3::P(X) :- Q(X), R(X). The formulation
of these rules does, however, require extensive expert knowledge or data in or-
der to derive the appropriate annotations. DeepProbLog extends this framework
by allowing uncertainty annotations to be derived from a neural network. Logic
Tensor Networks (LTNs,[1]) train neural predicates to maximise satisfiability of
a background theory, which is a form of semantic loss.
Neural networks are, in particular during training, prone to making mis-
takes that may result in logically inconsistent predictions. An image recognition
system may, for example, classify the same picture as a cat and a dog. In com-
bination with logic approaches, these mistakes may cause severe side effects for
other systems that expect consistent input [11]. In most classical logics, once
an inconsistency has been derived, any statement is entailed. This strong effect
of inconsistencies is not desirable in applications that must allow for some level
of inconsistency - in particular if human input is used. If a person makes an
inconsistent statement in their tax form, a possible neuro-symbolic tax system
should not be able to infer that Elvis is the king of Sweden or other arbitrary
facts from that - the inconsistency should be kept local. Logical Neural Networks
(LNNs, [25]) allow for some local inconsistencies in their reasoning process. This
kind of network is designed to directly represent the structure of a logical theory
with upper and lower bounds instead of truth values. During inference, these
systems also use a semantic penalty term that trains the system to avoid logical
inconsistencies.
Xu et al. propose a more general definition of a semantic loss for arbitrary
logical sentences [29]. Neural network outputs in a multi-class classification task
are interpreted as probabilities, leading to a probability that can be assigned to
each state in which a given logical sentence is either satisfied or not. The loss
function is then defined as the negative logarithm of the sum of probabilities for
each variable assignment satisfying the sentence φ in question:
X Y Y
LXu (φ, p) ∝ − log pi (1 − pi ) (1)
x|=φ x|=Xi i:x|=¬Xi
Models in this loss definition are binary. Therefore, a given model satisfies A ⊑ B
if and only if x |= ¬A or x |= B. For an implication A ⊑ B and a prediction
vector p , the loss is calculated as
P Q Q
LXu (A ⊑ B, p) ∝ − log pi (1 − pi )
x|=A⊑B x|=Xi i:x|=¬Xi
= − log((1 − pa ) · (1 − pb ) + (1 − pa ) · pb + pa · pb ) (2)
= − log(pa · pb − pa + 1)
= − log(1 − pa · (1 − pb ))
The loss definition of Xu et al. is similar to the one used in this work, although we
derive our loss from fuzzy logic instead of probabilistic. When using the product
t-norm, our loss for an implication is defined as
Lprod (A ⊑ B, p) = pa · (1 − pb ) (3)
The additional negative logarithm is introduced by Xu et al. to achieve a closer

correspondence to cross-entropy loss functions, while we use the result of the
logical evaluation directly. For comparison, we include the semantic loss defined
by Xu et al. in our evaluation in Section 5.
3 Semantic loss
Predictions made by a neural network may contradict a logical theory that

underlies the predicted labels. In this work, we aim to incentivise a model to
produce logically more consistent predictions by adding an additional term to
its loss function. While many types of ontology axioms exist, here, we focus on
two types that are widely used and domain independent: subsumption relations,
i.e., A ⊑ B, and disjointness, i.e., C ⊓ D ≡ ⊥. While these axioms are usually
interpreted in binary semantics, we need differentiable terms that can be used
for training a neural network. We achieve differentiability by applying a fuzzy-
logic interpretation [17] to our output values. Let hc : x → [0, 1] be a fuzzy
membership function for a given class c and T a fuzzy t-norm. Our semantic loss
term for implications is then defined as
LT (A ⊑ B, x) := ĥ(¬(A → B), x) = ĥ(A ∧ ¬B, x) = T (hA (x), 1 − hB (x)). (4)
This assumes that the fuzzy negation used is a strong negation N (x) = 1 − x.
Accordingly, the semantic loss term for disjointness is defined as
LT (C ⊓ D ≡ ⊥, x) := ĥ(¬¬(C ∧ D), x) = ĥ(C ∧ D, x) = T (hC (x), hD (x)). (5)
Intuitively, the semantic loss can be interpreted as the degree to which a given
prediction violates an ontological constraint.
In Section 5, we evaluate loss functions derived from two commonly used
t-norms, the product t-norm Tprod (a, b) = a · b and the Łukasiewicz t-norm
Tluka (a, b) = max(a + b − 1, 0).
Let x be a vector of length m of sample vectors, y the vector of m corre-
sponding label vectors and ŷ = [hA (x)]A∈O the vector of predicted labels. Based
on the loss terms given in Eqs 4 and 5, we define our loss function as follows:
LT (x, y) = Lbase (y, ŷ)

m
wimpl X X
+ T ((yî )A , 1 − (yî )B )
m i=1 (6)
A⊑B
m
wdisj X X
+ T ((yî )C , (yî )D )
m i=1
C⊓D≡⊥
The Lbase term refers to the supervised loss used to train the model on the classi-
fication task. The weights wimpl and wdisj are intended to adjust the importance
of the semantic loss terms in relation to the base loss and to compensate for the
different prevalences of the axiom types in the ontology. In general, we expect the
number of subsumption and disjointness relations in an ontology to vary based
on the hierarchy depth and number of disjointness axioms available. Therefore,
these weights have to be adjusted based on the task at hand.
3.1 Balanced implication loss

The loss terms for implication face an imbalance issue: Since the classes on the
left-hand side of each implication are subclasses of the right-hand side classes,
they necessarily have fewer members in the ontology and therefore fewer labels
in a given dataset. Since we include transitive subsumption relations as well,
the difference may be drastic, with some left-hand side classes representing only
a small fraction of the right-hand side class. Therefore, in case of violations, it
might be relatively inexpensive for the model to disregard classes further down
in the hierarchy entirely. This strategy results in a low number of implication
violations, since such classes appear mostly on the left-hand side of implications,
and a low supervised loss, due to the lack of positive samples.
However, this behaviour is clearly not in our interest. To counter-balance this
effect, the balanced implication loss has a lower gradient for the left-hand class
and a higher gradient for the right-hand class instead of applying the same gra-
dient to both classes. Practically, this is achieved with two additional parameters
k > 1 and ϵ > 0:
((hA (x) + ϵ)1/k − ϵ1/k )
LB
T (A ⊑ B, x; k, ϵ) = T ( , (1 − hB (x))k ) (7)
((1 + ϵ)1/k − ϵ1/k )
ϵ is a small constant that is added to hA (x) to avoid an infinite gradient at
hA (x) = 0. The additional ϵ-terms adjust the loss so that LB T = 0 if hA (x) = 0
and LBT = 1 if hA (x) = 1 and hB (x) = 0. In our evaluation, we will use ϵ = 0.01.
The parameter k modifies the loss term such that, in the maximal violation case
of hA (x) = 1 and hB (x) = 0, the gradient is larger for hB than for hA . For
∂LB 1 ∂LB
instance, for the product T -norm, ∂hAprod prod
(x) 1 = k and ∂hB (x) 0 = −k.
The regular implication loss can be seen as a specialised version of this bal-
anced implication loss where k = ϵ = 0.
4 Experimental setup
We evaluate the semantic loss for a classification task in the ChEBI ontology.
This task has been studied in our previous work and a deep learning-based ap-
proach for the ChEBI classification task has been developed [18,12,13]. In all
evaluations, we train an ELECTRA model [6] for a hierarchical multi-label clas-
sification task in which ChEBI classes act as labels and molecules as instances.
For a detailed description of the approach, we refer to [14]. Here, we just provide
an overview. The source code for our implementation is available on GitHub 1 .
1
https://github.com/ChEB-AI/python-chebai
4.1 Datasets
Our setup draws data from two sources. Labelled data is taken from the ChEBI
ontology [10,20], while additional unlabelled data is sourced from the PubChem
database [21]. All datasets are available on Kaggle 2 .
In all datasets, we use the SMILES (Simplified Molecular Input Line Entry
System) [28], a common string representation for chemical structures. It encodes
molecules as sequences in which characters represent atoms and bonds, with
additional notation for branches, rings and stereoisomerism.
For the labelled data, we use version 231 of ChEBI, which contains 185 thou-
sand SMILES-annotated classes. Out of these classes, we form the ChEBI100
dataset by attaching all superclasses as labels which have at least 100 SMILES-
annotated subclasses. The transitive closure of subsumption relations between
the labels is used for the semantic loss. Disjointness axioms for ChEBI are pro-
vided by an additional ontology module 3 . Here as well, we take the transitive
closure of all disjointness relations between label-classes. In total, this provides
us with 997 labels, 19,308 implication loss terms and 31,416 disjointness loss
terms.
From PubChem, we have sourced two distinct datasets. The first is used
during training while the second one, PubChem Hazardous, is only used in the
evaluation. The Hazardous dataset includes SMILES strings for chemicals that
are annotated with a class from the Globally Harmonized System of Classifi-
cation and Labelling of Chemicals (GHS) [24]. The GHS covers different kinds
of health, physical and environmental hazards and has been developed by the
United Nations as a standard for labelling hazardous chemicals and providing
related safety instructions. From this, we have removed all SMILES strings that
also appear in the labelled dataset. In our evaluation, we use this dataset to test
model performance for a data distribution outside the learning distribution.
For the training dataset, we have randomly selected 1 million SMILES strings
from PubChem. This set has been split into groups of 10,000. From each group,
we have selected the 2,000 SMILES strings with the lowest similarity score,
resulting in 200,000 instances. The similarity score used is the sum of pairwise
Tanimoto similarities between the RDKit fingerprints of a given SMILES string
and all other SMILES strings in the group. This way, we ensure with limited
computational expense that our dataset covers a broad range of chemicals.
The training set is used for two tasks: Firstly, a pretraining step prior to
the training on the ChEBI100 dataset which is shared by all models in our eval-
uation. And secondly, a semi-supervised training, in which we train a model
simultaneously on labelled and unlabelled data. In our evaluation, we will com-
pare semi-supervised training to training on only labelled data.
4.2 Loss function
In order to apply the semantic loss function from Eq. 6, we need to choose a
classification loss Lbase and assign the weights wimpl and wdisj .
2
https://www.kaggle.com/datasets/sfluegel/chebai-semantic-loss
3
https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi-disjoints.owl
For the classification loss, we have chosen a weighted binary cross-entropy
loss:
m
1 XX
Lbase (x, y) = wc yc · log hc (x) + (1 − yc ) · log(1 − hc (x)) (8)
m i=1
c∈C
Here, wc is a weight assigned to positive entries based on the class c. These

weights are used to increase the importance of classes with fewer members in an
imbalanced dataset. We apply the scheme introduced by [8] with β = 0.99 and
normalize the weights:
w′ · |C| ′ 1−β
wc = P c ′ where wc = (9)
w
c′ ∈C c′ 1 − β |c|
For the semantic loss terms, although the number of disjointness terms (31,416)
is larger than the number of implication terms (19,308), we have chosen the
weights wimpl = 0.01 and wdisj = 100. This is motivated by preliminary ex-
periments in which the implication loss was larger than the disjointness loss by
several orders of magnitude.
4.3 Violation metrics

In order to quantify the consistency of model predictions with the ontology, we
introduce a notion of true positives (TPs) and false negatives (FNs) for con-
sistency violations. In this context, all pairs of ChEBI100 -labels are considered
as violation-labels. These labels are positive if an explicit subsumption / dis-
jointness relation between both classes exists in ChEBI. Individual predictions
are converted into truth values according to a threshold of 0.5 and the resulting
truth values are compared against the label-pairs.
Given a sample x, we define the number of TPs as
#T Pimpl (x) = |{(A, B) : A ⊑ B ∧ hA (x) > 0.5 ∧ hB (x) > 0.5}| (10)
and the number of FNs as
#F Nimpl (x) = |{(A, B) : A ⊑ B ∧ hA (x) > 0.5 ∧ hB (x) ≤ 0.5}|. (11)
For disjointness, the definition is analogous:
#T Pdisj (x) = |{(C, D) : C ⊓ D ≡ ⊥ ∧ hC (x) > 0.5 ∧ hD (x) ≤ 0.5}| (12)
#F Ndisj (x) = |{(C, D) : C ⊓ D ≡ ⊥ ∧ hC (x) > 0.5 ∧ hD (x) > 0.5}|. (13)
This definition does not take the cases hA (x) ≤ 0.5 or hC (x) ≤ 0.5 into
account which could be considered as true positives as well since they do not
contradict the ontology axioms. However, since these cases do not require an
active prediction, we consider them as "consistent by default" and as less relevant
for our evaluation. Note that, although disjointness axioms are symmetric, this
non-symmetric metric requires that we consider both "directions" of the axiom:
Vocabulary size 1,400
Hidden size 256
# attention heads 8
# hidden layers 6
# max. epochs 200
learning rate 1e−3
Optimizer Adamax
wimpl 0.01
wdisj 100
β 0.99
Table 1: Hyperparameters used during training
If C ⊓ D ≡ ⊥, hC (x) ≤ 0.5 and hD (x) > 0.5, we do not count (C, D) as a

true positive, but instead count (D, C). This also means that #F Ndisj (x) will
necessarily be even since for every false negative (C, D), there is another false
negative (D, C).
Given the numbers of TPs and FNs, we use the false negative rate (FNR),
defined as
#F Nt (x)
F N Rt (x) = , (14)
#F Nt (x) + #T Pt (x)
in our evaluation, t being either impl or disj.
5 Results
We evaluate four configurations of the semantic loss, one using the Łukasiewicz
T -norm TLuka (x, y) = max(0, a + b − 1) and three using the product T -norm
Tprod (x, y) = x · y. For the product T -norm, we include, besides the "standard"
variant, one which uses the balanced implication loss LB described in Section 3.1
with k = 2 and the semi-supervised variant trained on a mixed ChEBI100 and
PubChem dataset (see Section 4.1). For comparison, we also include a con-
figuration using the semantic loss described by Xu et al. [29] and a baseline
configuration trained without semantic loss.
We have conducted a pretraining run on our PubChem dataset and sub-
sequently, 3 fine-tuning runs for each variant. In the following, we will only
report averages for the 3 runs, the results for individual runs can be found in
Appendix A.
The hyperparameters shared by all models are given in Table 1.We have split
the ChEBI100 dataset into a training, validation and test set with a 340/9/51
ratio. The evaluation has been conducted on the test set using the models with
the highest micro-F1 score from each training run.
Table 2 and Figure 1a show the false negative rate (FNR) for implication and
disjointness violations on the ChEBI100 and the PubChem Hazardous datasets.
It can be seen that all models outperform the baseline by a margin of about
two orders of magnitude. In absolute terms, this corresponds to 13 × 103 false
ChEBI100 PubChem Hazardous
baseline 0.0031 ± 0.0002 0.0067 ± 0.005
TLuka 2.09 × 10−5 ± 3.3 × 10−6 1.49 × 10−5 ± 2.6 × 10−5
Tprod 3.18 × 10−5 ± 1.2 × 10−5 5.54 × 10−5 ± 4.6 × 10−5
Tprod (k=2) 3.74 × 10−5 ± 1.5 × 10−5 7.29 × 10−5 ± 8.8 × 10−5
Tprod (mixed data) 5.91 × 10−5 ± 3.2 × 10−5 1.05 × 10−5 ± 9.8 × 10−6
Xu et al. 3.62 × 10−5 ± 1.2 × 10−5 3.93 × 10−5 ± 2.8 × 10−5
Table 2: Average FNR for binary implication violations on the ChEBI100 and
PubChem Hazardous datasets. The FNR has been calculated separately for each
run before averaging. In addition, the table shows the standard deviation be-
tween the runs.
(a) FNR for implication violations on the (b) F1-scores on the ChEBI100 dataset, ag-
ChEBI100 (left) and PubChem Hazardous gregated on the micro- (left) and macro-
datasets. The standard deviation is indi- level. The black bars indicate the standard
cated by black dashes along each bar. deviation between the 3 runs.
negatives for the baseline and between 81 FNs (for TLuka ) and 247 FNs (Tprod
with mixed data) for the semantic loss models. The number of true positives is
similar for all models (between 3.85 × 106 and 4.29 × 106 ).
For the Łukasiewicz t-norm models, we observe the lowest FNR on ChEBI100 .
The models trained with a product t-norm based loss and the semantic loss of
Xu et al. have slightly higher FNRs on ChEBI100 . Regarding the PubChem
Hazardous dataset, it is remarkable that, while most models have a smiliar or
slightly higher FNR compared to ChEBI100 , the models trained with additional
PubChem data perform significantly better on PubChem Hazardous.
Regarding the disjointness violations, we have not observed any violations
for any model except the baseline models. There, we have averages of 171 FNs
on ChEBI100 and 4 FNs on PubChem Hazardous. While these numbers are far
below the numbers of TPs (1.07 × 107 for ChEBI100 and 1.12 × 108 for PubChem
Hazardous), they show that the semantic loss had a consistency-improving effect.
For all semantic loss variants, the models were able to produce inconsistency-free
results regarding disjointness.
In addition, we also evaluated the predictive performance of all configura-
tions. As can be seen in Table 3 and Figure 1b, the F1-score for the models
Micro-F1 Macro-F1
baseline 0.913 ± 0.004 0.653 ± 0.02
TLuka 0.869 ± 0.003 0.395 ± 0.02
Tprod 0.907 ± 0.0003 0.593 ± 0.009
Tprod (k=2) 0.913 ± 0.004 0.643 ± 0.04
Tprod (mixed data) 0.898 ± 0.002 0.501 ± 0.03
Xu et al. 0.906 ± 0.002 0.579 ± 0.02
Table 3: F1-scores calculated on labelled data. The micro-F1 aggregates pre-
dictions over all classes before calculation the score, while the macro-F1 is the
average of the class-wise scores.
that were trained with semantic loss, with exception of the balanced version,
is slightly lower than for the baseline models. This is particularly true for the
models that were trained on mixed data or with the Łukasiewicz loss.
The lower performance of the Łukasiewicz models is linked to an unsuccess-
ful training. While the performance of all other models continuously increased
during training and converged to the level reported here near the end of allotted
200 epochs, for all 3 Łukasiewicz models, the performance started to drop at ap-
proximately 50 epochs into the training. Further analysis suggests that the drop
during training has been caused by exploding gradients. Here, we report the re-
sults for the best-performing models near the 50 epoch-mark. At that point, the
performance of the other semantic loss models was similar to the Łukasiewicz
models.
6 Discussion
Our results indicate that the introduction of a semantic loss during training in-
creases the overall logical consistency of predictions significantly. However, since
the number of consistency violations is relatively low even for the baseline model,
one might consider an a posteriori processing step that transforms the model out-
put into consistent predictions (e.g., by setting all but the highest output value
to 0 for disjointness axioms). This can be expected lead to little or no loss in pre-
dictive performance since most predictions are already non-violating and some
corrections might even turn wrong predictions into correct ones. However, we
have only considered the unprocessed predictions in our evaluation. We justify
this by the intended use cases: A model trained on the classification task may
further be used in downstream tasks (e.g., prediction of chemical properties [15]).
Those require that the model has actually learned the ontology’s structure. This
can only be achieved by giving the model direct feedback during training (as we
did with the semantic loss) instead of superficially correcting the results.
For most semantic loss variants, their increase of consistency comes to the
detriment of the actual predictive quality. This result seems contradictory at
first, as one would assume more consistent results to be better overall.
One possible explanation lies in the imbalanced character of ontology-based
datasets. The hierarchical relations between labels create a dataset in which a
Fig. 2: Value of the semantic loss variants Lprod , Lluka , LXu and LB prod with
k = 2 for a subsumption relation A ⊑ B with different values of hA (x) and
hB (x). LXu has been cut off at LXu = 1 since lim LXu (A ⊑ B, p) = ∞
pa →0,pb →1
significant imbalance is inevitable and cannot be overcome by sampling proce-

dures in any significant way. A class will always have less members than its
parents as long as there is an ontological distinction between them that is also
represented in the dataset. Consequently, classes that reside further down in
the hierarchy are often significantly smaller than those higher up. For ChEBI
in particular, more specialised classes also require the model to learn relatively
complex patterns from a limited amount of samples. The corresponding labels
receive a relatively small training signal that is then counteracted by the addi-
tional loss due to violations. This may render the model unable to learn some
smaller classes.
This is supported by the differences between micro- and macro-F1. For all
models, the macro-F1 is far lower than the micro-F1. This means that many small
classes that contribute little to the micro-F1, but receive a stronger weighting in
the macro-F1, perform badly. For the semantic loss variants, with the exception
of the balanced semantic loss, this gap widens (from 26% for the baseline to 31%
for Tprod or 47% for Tluka ). This shows that, when the predictive performance
decreases, it mostly affects classes with fewer members.
However, a general tendency to make less predictions cannot be observed.
While all models make a similar amount of predictions on the ChEBI100 dataset,
the semantic loss models make more predictions on average for the PubChem
dataset (9.4 with product loss, 8.1 for the baseline). This shows that the lower
F1-score in the final models is not due to a generally more "cautious" behaviour.
Instead, only some classes may get left out while additional (consistent, but
wrong) predictions are made for other classes. Also, the lower performance may
be attributed to differences in the learning process, e.g., a less explorative model
behaviour or a slower convergence.
In addition, the results confirm our hypothesis regarding the balanced se-
mantic loss. Without losing consistency compared to the unbalanced variant, it
is able to reach a predictive performance similar to the baseline. Recall that the
main difference between the balanced and unbalanced semantic loss is the gradi-
ent in cases where for a subsumption relation A ⊑ B, the model predicts hA (x)
close to 1 and hB (x) close to 0 (cf. Figure 2). There, without the balancing, both
classes get the same gradient. With balancing, the gradient is stronger for hB (x)
than for hA (x) (in our experiment, by a factor of approximately 4).
The balanced semantic loss has been successful in pushing the model to-
wards more consistent predictions without pushing it towards predictions that
contradict the labels of the classification task.
The results also indicate that the inclusion of unlabelled data into the train-
ing process does hedge the system against inconsistencies on unseen data. This
result can be particularly useful in scenarios in which the distribution of features
in the dataset is limited. Deep learning systems are prone to suffer from out-
of-distribution errors, e.g., unpredictable behaviour on data that has not been
sampled from the same distribution as the training data. Lifting this limitation
is often not easy because additional labelling is required. The semi-supervised
training method presented here can help to alleviate this problem.
7 Conclusion
In this work, we have introduced a semantic loss function for the task of on-
tology classification. Our semantic loss is based on a fuzzy logic interpretation
of the ontology subsumption and disjointness relations. To counteract the loss
function’s tendency to disincentivise predictions of low-level ontology classes,
we have proposed a balanced semantic loss variant as well. In our evaluation,
we have compared different versions of our semantic loss (based on either the
Łukasiewicz t-norm or the product t-norm) to a baseline model and the semantic
loss function proposed by [29]. We have shown that all semantic loss variants
were able to reduce the number of consistency violations by approximately two
orders of magnitude.
Regarding performance on the classification task, we have seen greater differ-
ences between the loss functions. Most variants have both a slightly lower micro-
and a significantly macro-F1 than the baseline (especially the Łukasiewicz-based
variant). This indicates that especially the predictive performance of small classes
is affected by the semantic loss. Only the balanced semantic loss was able to per-
form on a par with our baseline.
In addition, we have used the semantic loss for an additional training task
on unlabelled data. This allows us to generalise beyond the original data distri-
bution used for supervised training. Our evaluation on the Hazardous subset of
PubChem shows that this form of training can further improve the consistency
of predictions for out-of-distribution data.
Future work will include an improved normalisation to avoid performance
issues like we reported for the Łukasiewicz semantic loss. Also, it is possible
to extend our approach to other types of ontology axioms, e.g., parthood rela-
tions. Finally, it would be interesting to incorporate our finding of a balanced
implication loss into more general frameworks like LTNs or LNNs.
Acknowledgements
This work has been funded by the Deutsche Forschungsgesellschaft (DFG, Ger-
man Research Foundation) - 522907718.
A Result for individual runs
In Section 5, we have presented results as the average and standard deviation
out of 3 runs for every configuration.
Dataset Run1 Run2 Run3

Baseline ChEBI100 0.0034 0.0031 0.0029
Hazardous 0.0066 0.012 0.0015
TLuka ChEBI100 2.32 × 10−5 1.72 × 10−5 2.23 × 10−5
Hazardous 4.45 × 10−5 0 1.79 × 10−7
Tprod ChEBI100 1.94 × 10−5 3.36 × 10−5 4.24 × 10−5
Hazardous 6.26 × 10−5 5.72 × 10−6 9.78 × 10−5
Tprod (k=2) ChEBI100 2.41 × 10−5 3.46 × 10−5 5.35 × 10−5
Hazardous 2.22 × 10−5 0.00017 2.25 × 10−5
Xu et al. ChEBI100 4.44 × 10−5 2.30 × 10−5 4.12 × 10−5
Hazardous 6.68 × 10−6 5.58 × 10−5 5.56 × 10−5
Tprod (mixed data) ChEBI100 4.78 × 10−5 3.46 × 10−5 9.48 × 10−5
Hazardous 2.98 × 10−7 1.98 × 10−5 1.14 × 10−5
Table 4: FNRs for implication violations on the ChEBI100 and PubChem
datasets.
Aggregation Run1 Run2 Run3

Baseline micro 0.909 0.915 0.915
macro 0.625 0.669 0.665
TLuka micro 0.870 0.866 0.873
macro 0.399 0.369 0.415
Tprod micro 0.907 0.907 0.907
macro 0.584 0.600 0.597
Tprod (k=2) micro 0.915 0.916 0.908
macro 0.661 0.666 0.601
Xu et al. micro 0.904 0.908 0.905
macro 0.570 0.597 0.569
Tprod (mixed data) micro 0.895 0.899 0.899
macro 0.471 0.502 0.529
Table 5: F1-scores with micro and macro aggregation.
Tables 4 and 5 show the FNR and and F1-scores for the individual runs. For
the FNR, it can be observed that the results vary significantly between runs,
especially on the Hazardous datasets. This is likely due to the small scale we
are considering: On the ChEBI100 , a FNR of 2.5 × 10−5 roughly corresponds
to about 100 observed false negatives over the whole test set. I.e., out of 19
thousand samples, each of which had 19 thousand subsumption relations that
could have resulted in a false negative, only 100 actually are false negatives.
Therefore, slight changes in the predictive performance can have a significant
impact on the false negative rate.
The F1-scores are more stable overall, with a range of less than one percent
for the micro aggregation and up to six percent for the macro aggregation.
References
1. Badreddine, S., d’Avila Garcez, A.S., Serafini, L., Spranger, M.: Logic tensor net-
works. Artif. Intell. 303, 103649 (2022). https://doi.org/10.1016/J.ARTINT.
2021.103649, https://doi.org/10.1016/j.artint.2021.103649
2. Bayerlein, B., Schilling, M., Birkholz, H., Jung, M., Waitelonis, J., Mädler, L., Sack,
H.: Pmd core ontology: Achieving semantic interoperability in materials science.
Materials & Design 237, 112603 (2024)
3. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,
Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for
self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
4. Booshehri, M., Emele, L., Flügel, S., Förster, H., Frey, J., Frey, U., Glauer, M.,
Hastings, J., Hofmann, C., Hoyer-Klick, C., et al.: Introducing the open energy
ontology: Enhancing data interpretation and interfacing in energy systems analysis.
Energy and AI 5, 100074 (2021)
5. Cerri, R., Barros, R.C., De Carvalho, A.C.: Hierarchical multi-label classification
using local neural networks. Journal of Computer and System Sciences 80(1), 39–
56 (2014)
6. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-
coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
(2020)
7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. Journal of machine learning
research 12, 2493–2537 (2011)
8. Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on
effective number of samples. In: Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition. pp. 9268–9277 (2019)
9. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic prolog and its
application in link discovery. In: IJCAI 2007, Proceedings of the 20th international
joint conference on artificial intelligence. pp. 2462–2467. IJCAI-INT JOINT CONF
ARTIF INTELL (2007)
10. Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught,
A., Alcántara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database
and ontology for chemical entities of biological interest. Nucleic Acids Research
36(Database issue), D344–D350 (Jan 2008). https://doi.org/10.1093/nar/
gkm791, http://dx.doi.org/10.1093/nar/gkm791, publisher: European Bioinfor-
matics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
11. Giunchiglia, E., Stoian, M.C., Khan, S., Cuzzolin, F., Lukasiewicz, T.: Road-r: The
autonomous driving dataset with logical requirements. Machine Learning 112(9),
3261–3291 (2023)
12. Glauer, M., Memariani, A., Neuhaus, F., Mossakowski, T., Hastings, J.: Inter-
pretable ontology extension in chemistry. Semantic Web Preprint(Preprint), 1–22
(2023)
13. Glauer, M., Neuhaus, F., Mossakowski, T., Memariani, A., Hastings, J., Hitzler, P.,
Sarker, M., Eberhart, A.: Neuro-symbolic semantic learning for chemistry. Com-
pendium of Neurosymbolic Artificial Intelligence. Frontiers in Artificial Intelligence
and Applications pp. 460–484 (2023)
14. Glauer, M., Neuhaus, F., Flügel, S., Wosny, M., Mossakowski, T., Memariani, A.,
Schwerdt, J., Hastings, J.: Chebifier: Automating semantic classification in ChEBI
to accelerate data-driven discovery. Digital Discovery p. to appear (2024)
15. Glauer, M., Neuhaus, F., Mossakowski, T., Hastings, J.: Ontology pre-training for
poison prediction. In: German Conference on Artificial Intelligence (Künstliche
Intelligenz). pp. 31–45. Springer (2023)
16. Glauer, M., West, R., Michie, S., Hastings, J.: Esc-rules: Explainable, semantically
constrained rule sets. arXiv preprint arXiv:2208.12523 (2022)
17. Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business
Media (2013)
18. Hastings, J., Glauer, M., Memariani, A., Neuhaus, F., Mossakowski, T.: Learning
chemistry: exploring the suitability of machine learning for the task of structure-
based chemical ontology classification. Journal of Cheminformatics 13, 1–20 (2021)
19. Hastings, J., Glauer, M., West, R., Thomas, J., Wright, A.J., Michie, S.: Predicting
outcomes of smoking cessation interventions in novel scenarios using ontology-
informed, interpretable machine learning. Wellcome Open Research 8(503), 503
(2023)
20. Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V.,
Turner, S., Swainston, N., Mendes, P., Steinbeck, C.: ChEBI in 2016: Improved ser-
vices and an expanding collection of metabolites. Nucleic Acids Research 44(D1),
D1214–D1219 (Jan 2016). https://doi.org/10.1093/nar/gkv1031
21. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker,
B.A., Thiessen, P.A., Yu, B., et al.: Pubchem 2023 update. Nucleic acids research
51(D1), D1373–D1380 (2023)
22. Kruse, R., Nauck, D.: Neuro-fuzzy systems. In: Computational Intelligence: Soft
Computing and Fuzzy-Neuro Integration with Applications. pp. 230–259. Springer
(1998)
23. Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: Deep-
problog: Neural probabilistic logic programming. Advances in neural information
processing systems 31 (2018)
24. Nations, U.: Globally harmonized system of classification and labelling of chemi-
cals, rev. 10. Tech. rep., United Nations (2023)
25. Riegel, R., Gray, A., Luus, F., Khan, N., Makondo, N., Akhalwaya, I.Y., Qian,
H., Fagin, R., Barahona, F., Sharma, U., et al.: Logical neural networks. arXiv
preprint arXiv:2006.13155 (2020)
26. Towell, G.G., Shavlik, J.W.: Knowledge-based artificial neural networks. Artificial
intelligence 70(1-2), 119–165 (1994)
27. Wehrmann, J., Cerri, R., Barros, R.: Hierarchical multi-label classification net-
works. In: International conference on machine learning. pp. 5075–5084. PMLR
(2018)
28. Weininger, D.: Smiles, a chemical language and information system. 1. introduction
to methodology and encoding rules. Journal of chemical information and computer
sciences 28(1), 31–36 (1988)
29. Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A semantic loss function for
deep learning with symbolic knowledge. In: International conference on machine
learning. pp. 5502–5511. PMLR (2018)
30. Zhang, D., Bai, X.L., Cai, K.Y.: Extended neuro-fuzzy models of multilayer per-
ceptrons. Fuzzy sets and systems 142(2), 221–242 (2004)

A Semantic Loss For Ontology Classification

Uploaded by

Copyright:

Available Formats

A Semantic Loss For Ontology Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Semantic Loss For Ontology Classification

Uploaded by

Copyright:

Available Formats

A semantic loss for ontology classification

Simon Flügel1[0000−0003−3754−9016] , Martin Glauer1[0000−0001−6772−1943] , Till

Otto von Guericke University Magdeburg, Germany

Keywords: Semantic loss · ontology classification · ChEBI

The additional negative logarithm is introduced by Xu et al. to achieve a closer

Predictions made by a neural network may contradict a logical theory that

LT (A ⊑ B, x) := ĥ(¬(A → B), x) = ĥ(A ∧ ¬B, x) = T (hA (x), 1 − hB (x)). (4)

LT (C ⊓ D ≡ ⊥, x) := ĥ(¬¬(C ∧ D), x) = ĥ(C ∧ D, x) = T (hC (x), hD (x)). (5)

LT (x, y) = Lbase (y, ŷ)

3.1 Balanced implication loss

4.2 Loss function

Here, wc is a weight assigned to positive entries based on the class c. These

4.3 Violation metrics

and the number of FNs as

#F Nimpl (x) = |{(A, B) : A ⊑ B ∧ hA (x) > 0.5 ∧ hB (x) ≤ 0.5}|. (11)

For disjointness, the definition is analogous:

#T Pdisj (x) = |{(C, D) : C ⊓ D ≡ ⊥ ∧ hC (x) > 0.5 ∧ hD (x) ≤ 0.5}| (12)

If C ⊓ D ≡ ⊥, hC (x) ≤ 0.5 and hD (x) > 0.5, we do not count (C, D) as a

significant imbalance is inevitable and cannot be overcome by sampling proce-

Dataset Run1 Run2 Run3

Aggregation Run1 Run2 Run3

You might also like