A Semantic Loss For Ontology Classification
A Semantic Loss For Ontology Classification
A Semantic Loss For Ontology Classification
Abstract. Deep learning models are often unaware of the inherent con-
straints of the task they are applied to. However, many downstream tasks
require logical consistency. For ontology classification tasks, such con-
straints include subsumption and disjointness relations between classes.
In order to increase the consistency of deep learning models, we propose
a semantic loss that combines label-based loss with terms penalising
subsumption- or disjointness-violations. Our evaluation on the ChEBI
ontology shows that the semantic loss is able to decrease the number of
consistency violations by several orders of magnitude without decreasing
the classification performance. In addition, we use the semantic loss for
unsupervised learning. We show that this can further improve consistency
on data from a distribution outside the scope of the supervised training.
1 Introduction
Deep learning models have been successfully applied to a wide range of clas-
sification tasks over the past years, often replacing hand-crafted features with
end-to-end feature learning [7,3]. This approach is based on the assumption that
all the knowledge required to solve a specific classification task is available in
the data used. These systems are often built for a specific use case. In the case
of a classification problem, emphasis is placed on the correct classification of
the input data and the success of a system is measured in its ability to cor-
rectly perform this task. However, this approach disregards that there are often
domain-specific logical constraints between different classification targets.
These logical constraints can be of great importance, as applications, espe-
cially those leading to further development, are often based on the assumption
that inputs are logically consistent. Imagine a system consisting of two com-
ponents in an autonomous vehicle. The first component recognises and labels
objects using a deep learning model. Based on this output, a rule-based system
built by experts determines the direction of travel. A contradictory classification
of the first system, e.g. a traffic light as both red and green, or a road user as a
pedestrian and as a car, can have fatal consequences, as the control system may
not cover such a scenario.
It is therefore important to prime systems towards logical consistency. Learn-
ing concepts by example is not optimally suited to adhere to domain-specific
constraints out of to box. Instead, reliance is placed on the fact that the corre-
sponding constraints are represented in the data and that the model can approx-
imate them accordingly during training. However, this approach has significant
disadvantages. Firstly, it assumes that there is a sufficiently large amount of data
so that the corresponding constraints are well represented. Secondly, the system
is deprived of important information that is readily available in the domain.
Thirdly, it creates an additional, implicit learning task that is not adequately
represented by the loss function.
For many research domains, ontologies exist that define important concepts
and their relations via logical constraints [20,4,2]. Ontologies therefore provide
a necessary logical axiomatisation that can be used to check the consistency
of models and to prime them for consistency. For instance, the subsumption
relation A is-a B requires that every entity classified as A is also classified as
B. Usually, this knowledge is not explicitly given to machine learning models
trained on concepts from an ontology. Instead, it can only be derived implicitly
from seeing a large enough number of A samples that are also B samples.
The aim of this paper is to integrate symbolic knowledge from ontologies into
the learning process of a machine learning model. To this end, in Section 3, we
present a semantic loss that extends regular loss functions by additional terms
that ensure the model’s coherence with ontological constraints. In Section 4, we
introduce a classification task on the ChEBI ontology and appropriate evaluation
metrics. These are used in Section 5, where we evaluate different semantic loss
variants. The results are discussed in Section 6 and a conclusion is drawn in
Section 7.
2 Related Work
A well-studied field within Machine Learning are hierarchical multi-label classi-
fication tasks, in which labels are structured in a hierarchy, similar the subsump-
tion relations in an ontology. However, it is usually assumed that each class only
has one superclass which allows the assigning of hierarchy levels.Many models
use these levels directly in their architecture [27,5]. In ontologies such as ChEBI,
many classes have multiple superclasses, which makes the assignment of hier-
archy levels non-trivial. In addition, ontologies include different kinds of logical
relations between classes, such as disjointness or parthood relations. Therefore,
our tasks requires a more general approach towards ensuring logical consistency.
Among the approaches that have integrated logical constraints into neural
networks, one of the earliest have been “Knowledge-Based Artificial Neural Net-
works” (KBANN, [26]), which attempted to directly represent formulae in propo-
sitional logic within the network structure. During training, the system is able
to adapt these structures to better fit the training data. This allows the priming
of a learning system with prior knowledge.
The training process of neural networks is usually based on a form of gra-
dient descent. Consequently, in order to allow answers as truth values {0, 1},
one must allow arbitrary predictions from [0, 1] in order to remain differentiable.
This naturally leads to an interpretation of these values as values from a many-
valued logic such as fuzzy logic or probabilistic logic. Indeed, there have been
many approaches that aim to combine fuzzy systems and neural networks [22,30].
These systems are particularly useful when training data is limited. In a recent
work [16,19], we applied an ontology-based neuro-fuzzy controller. The approach
in this paper is inspired by this work, in which we also apply a semantic penalty
system to ensure logically sound rules.
DeepProbLog [23] follows a probabilistic interpretation of prediction val-
ues. This approach is based on the probabilistic logic programming framework
ProbLog [9]. ProbLog allows the expression of Prolog-like inference rules with ad-
ditional uncertainty annotations, e.g. 0.3::P(X) :- Q(X), R(X). The formulation
of these rules does, however, require extensive expert knowledge or data in or-
der to derive the appropriate annotations. DeepProbLog extends this framework
by allowing uncertainty annotations to be derived from a neural network. Logic
Tensor Networks (LTNs,[1]) train neural predicates to maximise satisfiability of
a background theory, which is a form of semantic loss.
Neural networks are, in particular during training, prone to making mis-
takes that may result in logically inconsistent predictions. An image recognition
system may, for example, classify the same picture as a cat and a dog. In com-
bination with logic approaches, these mistakes may cause severe side effects for
other systems that expect consistent input [11]. In most classical logics, once
an inconsistency has been derived, any statement is entailed. This strong effect
of inconsistencies is not desirable in applications that must allow for some level
of inconsistency - in particular if human input is used. If a person makes an
inconsistent statement in their tax form, a possible neuro-symbolic tax system
should not be able to infer that Elvis is the king of Sweden or other arbitrary
facts from that - the inconsistency should be kept local. Logical Neural Networks
(LNNs, [25]) allow for some local inconsistencies in their reasoning process. This
kind of network is designed to directly represent the structure of a logical theory
with upper and lower bounds instead of truth values. During inference, these
systems also use a semantic penalty term that trains the system to avoid logical
inconsistencies.
Xu et al. propose a more general definition of a semantic loss for arbitrary
logical sentences [29]. Neural network outputs in a multi-class classification task
are interpreted as probabilities, leading to a probability that can be assigned to
each state in which a given logical sentence is either satisfied or not. The loss
function is then defined as the negative logarithm of the sum of probabilities for
each variable assignment satisfying the sentence φ in question:
X Y Y
LXu (φ, p) ∝ − log pi (1 − pi ) (1)
x|=φ x|=Xi i:x|=¬Xi
Models in this loss definition are binary. Therefore, a given model satisfies A ⊑ B
if and only if x |= ¬A or x |= B. For an implication A ⊑ B and a prediction
vector p , the loss is calculated as
P Q Q
LXu (A ⊑ B, p) ∝ − log pi (1 − pi )
x|=A⊑B x|=Xi i:x|=¬Xi
= − log((1 − pa ) · (1 − pb ) + (1 − pa ) · pb + pa · pb ) (2)
= − log(pa · pb − pa + 1)
= − log(1 − pa · (1 − pb ))
The loss definition of Xu et al. is similar to the one used in this work, although we
derive our loss from fuzzy logic instead of probabilistic. When using the product
t-norm, our loss for an implication is defined as
Lprod (A ⊑ B, p) = pa · (1 − pb ) (3)
3 Semantic loss
This assumes that the fuzzy negation used is a strong negation N (x) = 1 − x.
Accordingly, the semantic loss term for disjointness is defined as
Intuitively, the semantic loss can be interpreted as the degree to which a given
prediction violates an ontological constraint.
In Section 5, we evaluate loss functions derived from two commonly used
t-norms, the product t-norm Tprod (a, b) = a · b and the Łukasiewicz t-norm
Tluka (a, b) = max(a + b − 1, 0).
Let x be a vector of length m of sample vectors, y the vector of m corre-
sponding label vectors and ŷ = [hA (x)]A∈O the vector of predicted labels. Based
on the loss terms given in Eqs 4 and 5, we define our loss function as follows:
4 Experimental setup
We evaluate the semantic loss for a classification task in the ChEBI ontology.
This task has been studied in our previous work and a deep learning-based ap-
proach for the ChEBI classification task has been developed [18,12,13]. In all
evaluations, we train an ELECTRA model [6] for a hierarchical multi-label clas-
sification task in which ChEBI classes act as labels and molecules as instances.
For a detailed description of the approach, we refer to [14]. Here, we just provide
an overview. The source code for our implementation is available on GitHub 1 .
1
https://github.com/ChEB-AI/python-chebai
4.1 Datasets
Our setup draws data from two sources. Labelled data is taken from the ChEBI
ontology [10,20], while additional unlabelled data is sourced from the PubChem
database [21]. All datasets are available on Kaggle 2 .
In all datasets, we use the SMILES (Simplified Molecular Input Line Entry
System) [28], a common string representation for chemical structures. It encodes
molecules as sequences in which characters represent atoms and bonds, with
additional notation for branches, rings and stereoisomerism.
For the labelled data, we use version 231 of ChEBI, which contains 185 thou-
sand SMILES-annotated classes. Out of these classes, we form the ChEBI100
dataset by attaching all superclasses as labels which have at least 100 SMILES-
annotated subclasses. The transitive closure of subsumption relations between
the labels is used for the semantic loss. Disjointness axioms for ChEBI are pro-
vided by an additional ontology module 3 . Here as well, we take the transitive
closure of all disjointness relations between label-classes. In total, this provides
us with 997 labels, 19,308 implication loss terms and 31,416 disjointness loss
terms.
From PubChem, we have sourced two distinct datasets. The first is used
during training while the second one, PubChem Hazardous, is only used in the
evaluation. The Hazardous dataset includes SMILES strings for chemicals that
are annotated with a class from the Globally Harmonized System of Classifi-
cation and Labelling of Chemicals (GHS) [24]. The GHS covers different kinds
of health, physical and environmental hazards and has been developed by the
United Nations as a standard for labelling hazardous chemicals and providing
related safety instructions. From this, we have removed all SMILES strings that
also appear in the labelled dataset. In our evaluation, we use this dataset to test
model performance for a data distribution outside the learning distribution.
For the training dataset, we have randomly selected 1 million SMILES strings
from PubChem. This set has been split into groups of 10,000. From each group,
we have selected the 2,000 SMILES strings with the lowest similarity score,
resulting in 200,000 instances. The similarity score used is the sum of pairwise
Tanimoto similarities between the RDKit fingerprints of a given SMILES string
and all other SMILES strings in the group. This way, we ensure with limited
computational expense that our dataset covers a broad range of chemicals.
The training set is used for two tasks: Firstly, a pretraining step prior to
the training on the ChEBI100 dataset which is shared by all models in our eval-
uation. And secondly, a semi-supervised training, in which we train a model
simultaneously on labelled and unlabelled data. In our evaluation, we will com-
pare semi-supervised training to training on only labelled data.
In order to apply the semantic loss function from Eq. 6, we need to choose a
classification loss Lbase and assign the weights wimpl and wdisj .
2
https://www.kaggle.com/datasets/sfluegel/chebai-semantic-loss
3
https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi-disjoints.owl
For the classification loss, we have chosen a weighted binary cross-entropy
loss:
m
1 XX
Lbase (x, y) = wc yc · log hc (x) + (1 − yc ) · log(1 − hc (x)) (8)
m i=1
c∈C
w′ · |C| ′ 1−β
wc = P c ′ where wc = (9)
w
c′ ∈C c′ 1 − β |c|
For the semantic loss terms, although the number of disjointness terms (31,416)
is larger than the number of implication terms (19,308), we have chosen the
weights wimpl = 0.01 and wdisj = 100. This is motivated by preliminary ex-
periments in which the implication loss was larger than the disjointness loss by
several orders of magnitude.
#T Pimpl (x) = |{(A, B) : A ⊑ B ∧ hA (x) > 0.5 ∧ hB (x) > 0.5}| (10)
#F Ndisj (x) = |{(C, D) : C ⊓ D ≡ ⊥ ∧ hC (x) > 0.5 ∧ hD (x) > 0.5}|. (13)
This definition does not take the cases hA (x) ≤ 0.5 or hC (x) ≤ 0.5 into
account which could be considered as true positives as well since they do not
contradict the ontology axioms. However, since these cases do not require an
active prediction, we consider them as "consistent by default" and as less relevant
for our evaluation. Note that, although disjointness axioms are symmetric, this
non-symmetric metric requires that we consider both "directions" of the axiom:
Vocabulary size 1,400
Hidden size 256
# attention heads 8
# hidden layers 6
# max. epochs 200
learning rate 1e−3
Optimizer Adamax
wimpl 0.01
wdisj 100
β 0.99
Table 1: Hyperparameters used during training
5 Results
We evaluate four configurations of the semantic loss, one using the Łukasiewicz
T -norm TLuka (x, y) = max(0, a + b − 1) and three using the product T -norm
Tprod (x, y) = x · y. For the product T -norm, we include, besides the "standard"
variant, one which uses the balanced implication loss LB described in Section 3.1
with k = 2 and the semi-supervised variant trained on a mixed ChEBI100 and
PubChem dataset (see Section 4.1). For comparison, we also include a con-
figuration using the semantic loss described by Xu et al. [29] and a baseline
configuration trained without semantic loss.
We have conducted a pretraining run on our PubChem dataset and sub-
sequently, 3 fine-tuning runs for each variant. In the following, we will only
report averages for the 3 runs, the results for individual runs can be found in
Appendix A.
The hyperparameters shared by all models are given in Table 1.We have split
the ChEBI100 dataset into a training, validation and test set with a 340/9/51
ratio. The evaluation has been conducted on the test set using the models with
the highest micro-F1 score from each training run.
Table 2 and Figure 1a show the false negative rate (FNR) for implication and
disjointness violations on the ChEBI100 and the PubChem Hazardous datasets.
It can be seen that all models outperform the baseline by a margin of about
two orders of magnitude. In absolute terms, this corresponds to 13 × 103 false
ChEBI100 PubChem Hazardous
baseline 0.0031 ± 0.0002 0.0067 ± 0.005
TLuka 2.09 × 10−5 ± 3.3 × 10−6 1.49 × 10−5 ± 2.6 × 10−5
Tprod 3.18 × 10−5 ± 1.2 × 10−5 5.54 × 10−5 ± 4.6 × 10−5
Tprod (k=2) 3.74 × 10−5 ± 1.5 × 10−5 7.29 × 10−5 ± 8.8 × 10−5
Tprod (mixed data) 5.91 × 10−5 ± 3.2 × 10−5 1.05 × 10−5 ± 9.8 × 10−6
Xu et al. 3.62 × 10−5 ± 1.2 × 10−5 3.93 × 10−5 ± 2.8 × 10−5
Table 2: Average FNR for binary implication violations on the ChEBI100 and
PubChem Hazardous datasets. The FNR has been calculated separately for each
run before averaging. In addition, the table shows the standard deviation be-
tween the runs.
(a) FNR for implication violations on the (b) F1-scores on the ChEBI100 dataset, ag-
ChEBI100 (left) and PubChem Hazardous gregated on the micro- (left) and macro-
datasets. The standard deviation is indi- level. The black bars indicate the standard
cated by black dashes along each bar. deviation between the 3 runs.
negatives for the baseline and between 81 FNs (for TLuka ) and 247 FNs (Tprod
with mixed data) for the semantic loss models. The number of true positives is
similar for all models (between 3.85 × 106 and 4.29 × 106 ).
For the Łukasiewicz t-norm models, we observe the lowest FNR on ChEBI100 .
The models trained with a product t-norm based loss and the semantic loss of
Xu et al. have slightly higher FNRs on ChEBI100 . Regarding the PubChem
Hazardous dataset, it is remarkable that, while most models have a smiliar or
slightly higher FNR compared to ChEBI100 , the models trained with additional
PubChem data perform significantly better on PubChem Hazardous.
Regarding the disjointness violations, we have not observed any violations
for any model except the baseline models. There, we have averages of 171 FNs
on ChEBI100 and 4 FNs on PubChem Hazardous. While these numbers are far
below the numbers of TPs (1.07 × 107 for ChEBI100 and 1.12 × 108 for PubChem
Hazardous), they show that the semantic loss had a consistency-improving effect.
For all semantic loss variants, the models were able to produce inconsistency-free
results regarding disjointness.
In addition, we also evaluated the predictive performance of all configura-
tions. As can be seen in Table 3 and Figure 1b, the F1-score for the models
Micro-F1 Macro-F1
baseline 0.913 ± 0.004 0.653 ± 0.02
TLuka 0.869 ± 0.003 0.395 ± 0.02
Tprod 0.907 ± 0.0003 0.593 ± 0.009
Tprod (k=2) 0.913 ± 0.004 0.643 ± 0.04
Tprod (mixed data) 0.898 ± 0.002 0.501 ± 0.03
Xu et al. 0.906 ± 0.002 0.579 ± 0.02
Table 3: F1-scores calculated on labelled data. The micro-F1 aggregates pre-
dictions over all classes before calculation the score, while the macro-F1 is the
average of the class-wise scores.
that were trained with semantic loss, with exception of the balanced version,
is slightly lower than for the baseline models. This is particularly true for the
models that were trained on mixed data or with the Łukasiewicz loss.
The lower performance of the Łukasiewicz models is linked to an unsuccess-
ful training. While the performance of all other models continuously increased
during training and converged to the level reported here near the end of allotted
200 epochs, for all 3 Łukasiewicz models, the performance started to drop at ap-
proximately 50 epochs into the training. Further analysis suggests that the drop
during training has been caused by exploding gradients. Here, we report the re-
sults for the best-performing models near the 50 epoch-mark. At that point, the
performance of the other semantic loss models was similar to the Łukasiewicz
models.
6 Discussion
Our results indicate that the introduction of a semantic loss during training in-
creases the overall logical consistency of predictions significantly. However, since
the number of consistency violations is relatively low even for the baseline model,
one might consider an a posteriori processing step that transforms the model out-
put into consistent predictions (e.g., by setting all but the highest output value
to 0 for disjointness axioms). This can be expected lead to little or no loss in pre-
dictive performance since most predictions are already non-violating and some
corrections might even turn wrong predictions into correct ones. However, we
have only considered the unprocessed predictions in our evaluation. We justify
this by the intended use cases: A model trained on the classification task may
further be used in downstream tasks (e.g., prediction of chemical properties [15]).
Those require that the model has actually learned the ontology’s structure. This
can only be achieved by giving the model direct feedback during training (as we
did with the semantic loss) instead of superficially correcting the results.
For most semantic loss variants, their increase of consistency comes to the
detriment of the actual predictive quality. This result seems contradictory at
first, as one would assume more consistent results to be better overall.
One possible explanation lies in the imbalanced character of ontology-based
datasets. The hierarchical relations between labels create a dataset in which a
Fig. 2: Value of the semantic loss variants Lprod , Lluka , LXu and LB prod with
k = 2 for a subsumption relation A ⊑ B with different values of hA (x) and
hB (x). LXu has been cut off at LXu = 1 since lim LXu (A ⊑ B, p) = ∞
pa →0,pb →1
7 Conclusion
In this work, we have introduced a semantic loss function for the task of on-
tology classification. Our semantic loss is based on a fuzzy logic interpretation
of the ontology subsumption and disjointness relations. To counteract the loss
function’s tendency to disincentivise predictions of low-level ontology classes,
we have proposed a balanced semantic loss variant as well. In our evaluation,
we have compared different versions of our semantic loss (based on either the
Łukasiewicz t-norm or the product t-norm) to a baseline model and the semantic
loss function proposed by [29]. We have shown that all semantic loss variants
were able to reduce the number of consistency violations by approximately two
orders of magnitude.
Regarding performance on the classification task, we have seen greater differ-
ences between the loss functions. Most variants have both a slightly lower micro-
and a significantly macro-F1 than the baseline (especially the Łukasiewicz-based
variant). This indicates that especially the predictive performance of small classes
is affected by the semantic loss. Only the balanced semantic loss was able to per-
form on a par with our baseline.
In addition, we have used the semantic loss for an additional training task
on unlabelled data. This allows us to generalise beyond the original data distri-
bution used for supervised training. Our evaluation on the Hazardous subset of
PubChem shows that this form of training can further improve the consistency
of predictions for out-of-distribution data.
Future work will include an improved normalisation to avoid performance
issues like we reported for the Łukasiewicz semantic loss. Also, it is possible
to extend our approach to other types of ontology axioms, e.g., parthood rela-
tions. Finally, it would be interesting to incorporate our finding of a balanced
implication loss into more general frameworks like LTNs or LNNs.
Acknowledgements
This work has been funded by the Deutsche Forschungsgesellschaft (DFG, Ger-
man Research Foundation) - 522907718.
A Result for individual runs
In Section 5, we have presented results as the average and standard deviation
out of 3 runs for every configuration.
Tables 4 and 5 show the FNR and and F1-scores for the individual runs. For
the FNR, it can be observed that the results vary significantly between runs,
especially on the Hazardous datasets. This is likely due to the small scale we
are considering: On the ChEBI100 , a FNR of 2.5 × 10−5 roughly corresponds
to about 100 observed false negatives over the whole test set. I.e., out of 19
thousand samples, each of which had 19 thousand subsumption relations that
could have resulted in a false negative, only 100 actually are false negatives.
Therefore, slight changes in the predictive performance can have a significant
impact on the false negative rate.
The F1-scores are more stable overall, with a range of less than one percent
for the micro aggregation and up to six percent for the macro aggregation.
References
1. Badreddine, S., d’Avila Garcez, A.S., Serafini, L., Spranger, M.: Logic tensor net-
works. Artif. Intell. 303, 103649 (2022). https://doi.org/10.1016/J.ARTINT.
2021.103649, https://doi.org/10.1016/j.artint.2021.103649
2. Bayerlein, B., Schilling, M., Birkholz, H., Jung, M., Waitelonis, J., Mädler, L., Sack,
H.: Pmd core ontology: Achieving semantic interoperability in materials science.
Materials & Design 237, 112603 (2024)
3. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P.,
Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for
self-driving cars. arXiv preprint arXiv:1604.07316 (2016)
4. Booshehri, M., Emele, L., Flügel, S., Förster, H., Frey, J., Frey, U., Glauer, M.,
Hastings, J., Hofmann, C., Hoyer-Klick, C., et al.: Introducing the open energy
ontology: Enhancing data interpretation and interfacing in energy systems analysis.
Energy and AI 5, 100074 (2021)
5. Cerri, R., Barros, R.C., De Carvalho, A.C.: Hierarchical multi-label classification
using local neural networks. Journal of Computer and System Sciences 80(1), 39–
56 (2014)
6. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text en-
coders as discriminators rather than generators. arXiv preprint arXiv:2003.10555
(2020)
7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. Journal of machine learning
research 12, 2493–2537 (2011)
8. Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on
effective number of samples. In: Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition. pp. 9268–9277 (2019)
9. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic prolog and its
application in link discovery. In: IJCAI 2007, Proceedings of the 20th international
joint conference on artificial intelligence. pp. 2462–2467. IJCAI-INT JOINT CONF
ARTIF INTELL (2007)
10. Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught,
A., Alcántara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database
and ontology for chemical entities of biological interest. Nucleic Acids Research
36(Database issue), D344–D350 (Jan 2008). https://doi.org/10.1093/nar/
gkm791, http://dx.doi.org/10.1093/nar/gkm791, publisher: European Bioinfor-
matics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
11. Giunchiglia, E., Stoian, M.C., Khan, S., Cuzzolin, F., Lukasiewicz, T.: Road-r: The
autonomous driving dataset with logical requirements. Machine Learning 112(9),
3261–3291 (2023)
12. Glauer, M., Memariani, A., Neuhaus, F., Mossakowski, T., Hastings, J.: Inter-
pretable ontology extension in chemistry. Semantic Web Preprint(Preprint), 1–22
(2023)
13. Glauer, M., Neuhaus, F., Mossakowski, T., Memariani, A., Hastings, J., Hitzler, P.,
Sarker, M., Eberhart, A.: Neuro-symbolic semantic learning for chemistry. Com-
pendium of Neurosymbolic Artificial Intelligence. Frontiers in Artificial Intelligence
and Applications pp. 460–484 (2023)
14. Glauer, M., Neuhaus, F., Flügel, S., Wosny, M., Mossakowski, T., Memariani, A.,
Schwerdt, J., Hastings, J.: Chebifier: Automating semantic classification in ChEBI
to accelerate data-driven discovery. Digital Discovery p. to appear (2024)
15. Glauer, M., Neuhaus, F., Mossakowski, T., Hastings, J.: Ontology pre-training for
poison prediction. In: German Conference on Artificial Intelligence (Künstliche
Intelligenz). pp. 31–45. Springer (2023)
16. Glauer, M., West, R., Michie, S., Hastings, J.: Esc-rules: Explainable, semantically
constrained rule sets. arXiv preprint arXiv:2208.12523 (2022)
17. Hájek, P.: Metamathematics of fuzzy logic, vol. 4. Springer Science & Business
Media (2013)
18. Hastings, J., Glauer, M., Memariani, A., Neuhaus, F., Mossakowski, T.: Learning
chemistry: exploring the suitability of machine learning for the task of structure-
based chemical ontology classification. Journal of Cheminformatics 13, 1–20 (2021)
19. Hastings, J., Glauer, M., West, R., Thomas, J., Wright, A.J., Michie, S.: Predicting
outcomes of smoking cessation interventions in novel scenarios using ontology-
informed, interpretable machine learning. Wellcome Open Research 8(503), 503
(2023)
20. Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V.,
Turner, S., Swainston, N., Mendes, P., Steinbeck, C.: ChEBI in 2016: Improved ser-
vices and an expanding collection of metabolites. Nucleic Acids Research 44(D1),
D1214–D1219 (Jan 2016). https://doi.org/10.1093/nar/gkv1031
21. Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker,
B.A., Thiessen, P.A., Yu, B., et al.: Pubchem 2023 update. Nucleic acids research
51(D1), D1373–D1380 (2023)
22. Kruse, R., Nauck, D.: Neuro-fuzzy systems. In: Computational Intelligence: Soft
Computing and Fuzzy-Neuro Integration with Applications. pp. 230–259. Springer
(1998)
23. Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: Deep-
problog: Neural probabilistic logic programming. Advances in neural information
processing systems 31 (2018)
24. Nations, U.: Globally harmonized system of classification and labelling of chemi-
cals, rev. 10. Tech. rep., United Nations (2023)
25. Riegel, R., Gray, A., Luus, F., Khan, N., Makondo, N., Akhalwaya, I.Y., Qian,
H., Fagin, R., Barahona, F., Sharma, U., et al.: Logical neural networks. arXiv
preprint arXiv:2006.13155 (2020)
26. Towell, G.G., Shavlik, J.W.: Knowledge-based artificial neural networks. Artificial
intelligence 70(1-2), 119–165 (1994)
27. Wehrmann, J., Cerri, R., Barros, R.: Hierarchical multi-label classification net-
works. In: International conference on machine learning. pp. 5075–5084. PMLR
(2018)
28. Weininger, D.: Smiles, a chemical language and information system. 1. introduction
to methodology and encoding rules. Journal of chemical information and computer
sciences 28(1), 31–36 (1988)
29. Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A semantic loss function for
deep learning with symbolic knowledge. In: International conference on machine
learning. pp. 5502–5511. PMLR (2018)
30. Zhang, D., Bai, X.L., Cai, K.Y.: Extended neuro-fuzzy models of multilayer per-
ceptrons. Fuzzy sets and systems 142(2), 221–242 (2004)