Schulz 2001

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

International Journal of Medical Informatics 64 (2001) 207– 221

www.elsevier.com/locate/ijmedinf

Medical knowledge reengineering —converting major


portions of the UMLS into a terminological knowledge base
Stefan Schulz a,*, Udo Hahn b
a
Freiburg Uni6ersity Hospital, Department of Medical Informatics, Stefan-Meier-Straße 26, D-79104 Freiburg, Germany
b
Freiburg Uni6ersity, Computational Linguistics Lab, Werthmannplatz 1, D-79085 Freiburg, Germany

Abstract

We describe a semi-automatic knowledge engineering approach for converting the human anatomy and pathology
portion of the UMLS metathesaurus into a terminological knowledge base. Particular attention is paid to the proper
representation of part-whole hierarchies, which complement taxonomic ones as a major hierarchy-forming principle
for anatomical knowledge. Our approach consists of four steps. First, concept definitions are automatically generated
from the metathesaurus, with LOOM as the target language. Second, integrity checking of the emerging taxonomic and
partonomic hierarchies is automatically carried out by the terminological classifier. Third, terminological cycles and
inconsistencies are manually eliminated and, in the last step, the knowledge base built this way is incrementally refined
by a medical expert. Our experiments were run on a terminological knowledge base which is composed of 164 000
concepts and 76 000 relations. Empirical evidence for the lack of logical consistency, adequacy and improper
granularity of the UMLS knowledge source is given, and finally, assessments of what kind of efforts are needed to
render the formal target representation structures complete and empirically adequate. © 2001 Elsevier Science Ireland
Ltd. All rights reserved.

Keywords: UMLS; Description logics; Anatomy; Pathology

1. Introduction cording to the different tasks they have been


designed for, such as statistics, clinical com-
The health care domain and the biomedical munication, accounting or document index-
sciences are somewhat unique compared with ing, they exhibit considerable variability both
other scientific areas, since large portions of in terms of coverage and granularity. Also
their terminological knowledge are already the way knowledge is organized differs be-
structured in terms of controlled terminolo- tween heterogeneous types of medical termi-
gies, classification systems and thesauri. Ac- nologies [1,2]. Classifications aim at providing
exhaustive sets of mutually exclusive cate-
gories (or classes) such as the International
* Corresponding author. Tel.: + 49-761-203-6702; fax: +
49-761-203-6711. Classification of Diseases (ICD) [3]. More
E-mail address: [email protected] (S. Schulz). complex systems such as nomenclatures (e.g.

1386-5056/01/$ - see front matter © 2001 Elsevier Science Ireland Ltd. All rights reserved.
PII: S 1 3 8 6 - 5 0 5 6 ( 0 1 ) 0 0 2 0 1 - 5
208 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

SNOMED [4] or NHS clinical terms [5]) and more expressiveness and formal rigor in
thesauri (e.g. MeSH [6]) provide additional terms of increasing modeling efforts and,
descriptive flexibility by way of composition- hence, increasing maintenance costs. There-
ality of concepts, polyhierarchies and seman- fore, concrete medical knowledge bases mak-
tic links— often, however, at the price of ing full use of this rigid approach, especially
increasing ambiguity and semantic vagueness. those which employ high-end, KL-ONE-style
Although various kinds of medical terminolo- knowledge representation languages (for a
gies are well adapted to different needs, the survey, cf. [20]), are usually restricted to
demand for homogeneous multi-purpose ter- rather small subdomains. Those systems de-
minology servers has been increasingly ex- veloped within the framework of the above-
pressed [7– 11]. mentioned formal approaches have all been
The ‘Unified Medical Language System’ designed from scratch—without making sys-
(UMLS) [12] can be considered as a direct tematic use of the large body of knowledge
response to this request. It contains about contained in informal medical terminologies.
800 000 concepts from more than 60 different An intriguing approach would be to com-
classifications, nomenclatures and thesauri, bine the massive co6erage offered by informal
all of which have been merged into the medical terminologies with the high level of
UMLS Metathesaurus. Additional semantic expressi6eness supported by formally solid
structure can be imposed on concepts by knowledge representation systems in order to
using 134 semantic types, provided by the develop sophisticated medical knowledge
UMLS Semantic Network, together with 54 bases on a larger scale. This idea has already
semantic relations.1http://umlsinfo.nlm.nih. been fostered by Pisanelli et al. [10], who
gov/ Given its size, evolutionary diversity and extracted knowledge from the UMLS seman-
inherent heterogeneity, there is no surprise at tic network as well as from parts of the
all that the lack of a solid formal foundation metathesaurus and merged it with generic
leads to a bunch of inconsistencies, circular ontologies from other sources. In a similar
definitions, etc. [13,14]. This may not cause way, Spackman and Campbell [21] describe
utterly severe problems when humans are in how SNOMED [4] can be transformed from
the loop and its use is limited to tasks such as a multi-axial coding system into a formally
those mentioned above. However, anticipat- founded ontology. Unfortunately, efforts up
ing its use for more knowledge-intensive ap- to now are entirely focused on taxonomic
plications, such as natural language reasoning along generalization hierarchies
understanding of medical narratives [15] or (expressed by is-a relations) and lack a rea-
medical decision support systems [16], those sonable coverage of part-whole (i.e. part-of
shortcomings might lead to an impasse. or has-part) relationships, a second major
As a consequence, formal models for deal- conceptual construct needed for reasoning in
ing with medical knowledge have been pro- the anatomy domain, in particular.
posed, using representation mechanisms This article is organized as follows. In Sec-
based on conceptual graphs, semantic net- tion 2, we argue for the relevance of part-
works or description logics [17–19]. Not sur- whole reasoning for the medical domain and
prisingly, there is also a price to be paid for introduce a representation model which is
rooted in a description logics framework [20].
In particular, we propose a tripartite data
1
UMLS is accessible via http://umlsinfo.nlm.nih.gov/. structure for encoding anatomical concepts in
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 209

Fig. 1. SEP triplets: partitive relations within taxonomies.

order to emulate partonomic reasoning by the development of semantic networks of


taxonomic reasoning. Section 3 contains an anatomical concepts [23,24]. Although they
in-depth description of a four-step knowledge provide ontologically precise descriptions of
engineering procedure for semi-automatically partonomies, their granularity level is usually
converting UMLS specifications into a termi- rather high. Also, these terminological re-
nological knowledge base. Throughout this sources do not provide a formally founded
procedure our emphasis is on maintaining the methodology for part-whole reasoning that
consistency of the emerging knowledge base. underlies various object-centered representa-
We conclude in Section 4 by discussing some tion approaches as discussed by Artale et al.
implications of our approach and prospects [25]. In one of these branches, the description
of future work. logics community, several language exten-
sions for knowledge representation systems
have been proposed which provide special
2. Part-whole reasoning constructors for part-whole reasoning [19,26].
Motivated by proposals from Schmolze
As far as medical knowledge is concerned, and Mark [27], as well as by design principles
two main hierarchy-building relationships underlying the Read Codes Version 3 [28], we
can be identified, namely taxonomic (is-a) advocate an alternative solution for part-
and partonomic (part-whole) ones. Unlike whole reasoning, one that does not exceed
taxonomic reasoning in concept hierarchies, the expressiveness of the well-understood,
no fully conclusive mechanism exists up to parsimonious concept language ALC [29].
now for reasoning along partonomic hier- Unlike the constructor-based approaches
archies in description logic systems. As mentioned before, our approach can easily
anatomical knowledge, a crucial portion of cope with many of the exceptions to the
medical knowledge, is principally organized transitivity of the part-of relation, which one
along part-whole hierarchies, any proper encounters not only in medicine [30,31] but
medical knowledge representation has to take also in commonsense domains [32,33].
account of both hierarchy types [22]. Instead of defining new operators with a
The outstanding importance of part-whole built-in transitivity property, our proposal is
hierarchies for anatomy and, consequently, centered around a particular data structure,
for clinical medicine has recently motivated so-called SEP triplets, especially designed for
210 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

empirically adequate part-whole reasoning factor (viz. two additional nodes per concept
(cf. the structural description in Fig. 1). They at most).
define a characteristic pattern of is-a hier-
archies, which support the emulation of infer-
ences typical of transitive part-of relations, as 3. Semi-automatic transformation of an
well as exceptions to it. In this formalism, the informal knowledge repository into a formal
relation anatomical-part-of describes the par- terminological knowledge base
titive relation between physical parts of an
organism. Our goal is to extract conceptual knowl-
Each basic anatomical concept node is ex- edge from two highly relevant subdomains of
panded to an SEP triplet. Such a triplet the UMLS, anatomy and pathology, and to
consists, first of all, of the anatomical con- map it (semi-)automatically into a formally
cept itself, the so-called E-node (entity node). sound medical knowledge base. We use
As an example, in Fig. 1, HE stands for the LOOM [35,36], a KL-ONE-style terminological
concept of the entire Hand. The second node knowledge representation language, as our
of the triplet construct, the P-node (part implementation platform (for alternatives, cf.
node) is defined as the common subsumer of [37]), though our approach does in no way
all concepts which have the role anatomical- depend on particular features of that lan-
part-of filled by the corresponding E-node. guage.2 The knowledge transformation task is
P-nodes can therefore be considered as a kind divided into four steps: (1) the automatic
of reification of the relation anatomical-part- generation of terminological assertions, (2)
of. In Fig. 1, the P-node HP subsumes every their submission to a terminological classifier3
concept which has HE (Hand) as a filler of for consistency checking, (3) the manual
the role anatomical-part-of, e.g. FE (Finger). restitution of formal consistency in case of
Finally, both, the P- and E-node, have a inconsistencies, and, finally, (4) the manual
common direct subsumer, the so-called S- rectification and refinement of the resulting
node (structure node), HS (Hand-Structure) in knowledge base. These four steps are illus-
Fig. 1. By definition, E-nodes and P-nodes trated by the workflow diagram depicted in
are mutually disjoint, thus restricting anatom- Fig. 2.
ical-part-of to proper parthood, i.e. no
anatomical concept can be anatomical-part-of 3.1. Step 1: automatic generation of
itself (e.g. no object in the world can be terminological assertions
considered a Hand and a Part of a Hand
simultaneously. This constraint might be re- Sources for concepts and relations were the
laxed under certain circumstances [34].) The UMLS semantic network and the mrrel, mr-
SEP triplet construct can then be used to con and mrsty tables of the 1999 release of
emulate transitive part-of hierarchies by link-
ing S-nodes to P-nodes (cf. the is-a link be-
2
Cf. also the work of Carenini and Moore [38] who have
tween CS and DP in Fig. 1), and to exclude already suggested a graphical interactive tool for mapping
transitivity by linking nontransitive proper- UMLS concepts semi-automatically into a LOOM knowledge
ties to the corresponding E-node of a SEP base environment.
3
triplet [30,31]. The solution we propose is The description classifier of a terminological knowledge
representation system [36] is the inference engine that com-
computationally neutral insofar as we extend putes subsumption relations between concepts, i.e. the general-
the number of concept nodes by a constant ization hierarchies that can be derived from is-a relations.
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 211

Fig. 2. Workflow diagram for the construction of a terminological knowledge base from the UMLS.
212 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

Fig. 3. Semantic relations in the UMLS metathesaurus.

the UMLS metathesaurus. The mrrel table pathology6 types given in the UMLS seman-
contains roughly 7.5 million records and ex- tic network. 2247 concepts were included in
hibits the semantic links between two concept both sets, anatomy and pathology. This find-
unique identifiers (CUIs)4, the mrcon table ing can easily be justified by the observation
contains the concept names and mrsty keeps that these hybrid concepts exhibit, indeed,
the semantic type(s) assigned to each CUI. multiple meanings.7 As we wanted to keep
These tables (cf. Fig. 3 for a fragment), avail- Table 1
able as ASCII files, were imported into a A triplet in extended LOOM format
Microsoft Access relational database and ma- (deftriplet HEART
nipulated using SQL embedded in the VBA :is-primitive HOLLOW-VISCUS
programming language. For each CUI in the :has-part (:p-and
mrrel subset its alphanumeric code was sub- ANATOMICAL-FEATURE-OF-HEART
stituted by the English preferred term given FIBROUS-SKELETON-OF-HEART
WALL-OF-HEART
in mrcon. CAVITY-OF-HEART
From a total of 85 899 concepts, we ex- CARDIAC-CHAMBER-NOS
tracted 38 059 anatomy and 50 087 pathology LEFT-CORONARY-SULCUS
concepts from the metathesaurus. Each con- RIGHT-CORONARY-SULCUS
SURFACE-OF-HEART-NOS
cept was included in this set, which belonged
LEFT-SIDE-OF-HEART
to a set of predefined anatomy5 and RIGHT-SIDE-OF-HEART
AORTIC-VALVE
4
As a coding convention in UMLS, any two CUIs must be TRICUSPID-VALVE
connected by at least a shallow relation (in Fig. 3, CHilD PULMONARY-VALVE
relations in the column REL are assumed between CUIs). These MITRAL-VALVE
shallow relations may be refined in the column RELA, if a HEART-VALVES-100))
thesaurus is available which contains more specific information.
Some CUIs are linked either by part-of or is-a. In any case, the
source thesaurus for the relations and the CUIs involved is
specified in the columns X and Y (e.g. MeSH 1999 (MSH99), 6
Pathologic Function, Disease or Syndrome, Mental or Beha6-
SNOMED International 1998 (SNMI98). ioral Dysfunction, Cellular or Molecular Dysfunction, Experi-
5
Anatomical Structure, Embryonic Structure, Congenital Ab- mental Model of Disease, Neoplastic Process.
7
normality, Acquired Abnormality, Fully Formed Anatomical For instance, Tumor has the meaning of a malignant disease
Structure, Body System, Body Part Organ or Organ Component, on the one hand, and of an anatomical structure on the other
Tissue, Cell, Cell Component, Gene or Genome, Body Location hand. The same applies to congenital and acquired malforma-
or Region, Body Space or Junction, Anatomical Abnormality. tions, e.g. Claw Foot.
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 213

from the UMLS were considered for the con-


struction of taxonomic and partonomic hier-
archies (cf. Fig. 2). Hence, for each anatomy
concept, one SEP triplet is created. The result
is a mixed is-a/part-whole hierarchy (cf. Fig.
4).
For the pathology domain, we assumed the
values CHD (child), ISA and RN (narrower
relation) from the mrrel REL field as indica-
tors of taxonomic links. For all anatomy
concepts referred to in the definitional state-
ments of pathology concepts, the ‘S-node’ is
the default concept to which they are linked,
thus enabling the propagation of roles across
the part-whole hierarchy (see below).
Fig. 4. A mixed is-a and part-of hierarchy. In both subdomains, shallow relations,
such as the extremely frequent SIB (sibling)
the two subdomains strictly disjoint, we du- relation, were included as comments into the
plicated these hybrids and prefixed them with code to give some heuristic guidance for the
‘ana-‘ or ‘pat-’ according to their respective manual refinement phase (cf. Fig. 2).
subdomain. The a priori assignment to the
above-mentioned semantic types in the
UMLS is the only selection criterion; we 3.2. Step 2: consistency checking by the
refrained from any manual interference at description classifier
this processing stage.
The import of UMLS anatomy concepts
Anatomy and pathology concepts received
resulted in 38 059 DEFTRIPLET expressions for
a different formal treatment, however. As
anatomical concepts and 50 087 DEFCONCEPT
target structures for the anatomy domain we
expressions for pathological concepts. Each
chose SEP triplets. These were expressed in
DEFTRIPLET was expanded into three DEF-
the terminological language LOOM, which we
CONCEPT (S-, E-, and P-nodes), and two DE-
had previously extended by a special
FRELATION (anatomical-part-of-x, in6-
DEFTRIPLET macro (cf. Table 1 for an exam-
anatomical-part-of-x) expressions, summing
ple).8 Only part-of, has-part and is-a relation
up to 114 177 concepts and 76 118 relations.
attributes (RELA fields in the mrrel table)
Thus we obtained (together with 382 con-
8
cepts from the semantic network) a total of
The UMLS anatomy concepts are mapped an intermediate
language, P-LOOM, the reason being that the manual refine-
240 764 definitory LOOM expressions.
ment of automatically generated LOOM triplets is time-consum- From 38 059 anatomy triplets, 1219
ing and too error-prone due to their complex internal DEFTRIPLET statements exhibited a :has-part
structure. P-LOOM provides the full expressiveness of LOOM,
enriched by special constructors for the encoding of the part-
clause followed by a list of a variable number
whole relations, as well as for direct manipulation of the triplet of triplets, containing more than one argu-
elements, whenever necessary. The main feature of P-LOOM, ment in 823 cases (average cardinality: 3.3).
the macro DEFTRIPLET, shares the syntax of the concept-form- 4043 DEFTRIPLET statements contained a
ing LOOM constructor DEFCONCEPT, augmented by the key-
words :part-of and :has-part (both are followed by a list of :part-of clause, only in 332 cases followed by
SEP triplets). more than one argument (average cardinality:
214 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

1.1). The obtained knowledge base was then by relaxing the disjointness constraint, or by
submitted to the terminological classifier and disabling is-a or part-of links.
automatically checked for terminological cy- In the pathology part of the knowledge
cles and consistency. A terminological cycle is base, we expected a large number of termino-
given when A subsumes B and A is subsumed logical cycles to arise as a consequence of
by B, as well. Inconsistencies occur when interpreting the notoriously weak, thesaurus-
constraints (e.g. role restrictions) are violated. style RN (narrower) and CHilD relations
In the anatomy subdomain, one terminologi- through taxonomic subsumption (is-a). Bear-
cal cycle and 2328 inconsistent concept defin- ing in mind the size of the knowledge base,
itions were identified; in the pathology we consider 355 cycles a tolerable amount of
subdomain 355 terminological cycles were de- noise. Those cycles were primarily due to
termined though no inconsistent concept defi- very similar concepts, e.g. Arteriosclerosis 6s.
nition at all was found (cf. Table 2). Atherosclerosis, Amaurosis 6s. Blindness, and
residual categories (‘other’, ‘NOS’ = not oth-
3.3. Step 3: manual restitution of consistency erwise specified). These were directly inher-
ited from the source terminologies and are
The inconsistencies of the anatomy part of always difficult to interpret out of their defin-
the knowledge base identified by the classifier itional context, e.g. Other-Malignant-Neo-
could be traced back to the simultaneous plasm-of-Skin 6s. Malignant-Neoplasm-of-
linkage of two triplets by both is-a and part- Skin-NOS. The cycles were analyzed and a
of links, an encoding that raises a conflict due negative list which consisted of 630 concept
to the disjointness required for corresponding pairs was manually derived. In a subsequent
P- and E-nodes. In most of these cases the extraction cycle, we incorporated this list in
affected parents belong to a class of concepts the automated construction of the LOOM con-
that obviously cannot be appropriately mod- cept definitions. By adding these new con-
eled as SEP triplets, e.g. Subdi6ision-Of-As- straints a fully consistent knowledge base was
cending-Aorta, Organ-Part. The meaning of generated.
these concepts almost paraphrases that of a
P-node, so that in these cases the violation of 3.4. Step 4: manual rectification and
the SEP-internal disjointness condition could refinement of the knowledge base
be accounted for by substituting the involved
triplets with simple LOOM concepts, by Adding value to a consistent though possi-
matching them with already existing P-nodes, bly underspecified or even misspecified
knowledge base is an extremely time-consum-
ing job and requires broad and in-depth med-
Table 2
Classification results for anatomy and pathology con-
ical expertise. In order to roughly assess the
cepts potential workload for future knowledge base
finishing, we extracted two random samples
Anatomy Pathology (n=100 each) from both the anatomy and
pathology part of the knowledge base; the
Triplets 38 059 –
samples were then analyzed by a medical
DEFCONCEPT statements 114 177 50 087
Cycles 1 355 student and a physician. From the experience
Inconsistencies 2328 0 we gained in both subdomains so far, the
following workflow can be derived:
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 215

3.4.1. Checking the correctness and relations (mostly PARRB, i.e. parent and
completeness of both the taxonomic and broader relations) were upgraded to has-part
partiti6e hierarchies relations. After this workup and upgrade of
Taxonomic and partitive links are manu- shallow UMLS relations to semantically
ally added or removed in order to eliminate more specific relations, the sample was
inadequate concept descriptions and to in- checked for completeness again. As a result,
crease the completeness and to deepen the 14 is-a and 37 part-of relations were still
granularity of concept descriptions. Primitive considered missing.
subsumption (where necessary conditions for In the pathology sample, the assignment to
a specialization relation between concepts are the pathology subdomain was considered
specified only) is substituted by a nonprimi- plausible for 99 of 100 concepts. A total of 15
tive one (where necessary and sufficient con- false is-a relations were identified in 12 con-
ditions for a specialization relation between cept definitions, while 24 is-a relations were
concepts are specified) whenever possible. considered to be missing.
This is a crucial point, because the automati-
cally generated hierarchies contain only infor- 3.4.3. Checking :has-part arguments
mation about the parent concepts and assuming ‘real anatomy’
necessary conditions. As an example, the au- In the UMLS sources part-of and has-part
tomatically generated definition of Dermatitis are considered symmetric. According to our
includes the information that it is an Inflam- transformation rules, the attachment of a role
mation and that the role has-location must be has-anatomical-part to an E-node BE, with its
filled by the concept Skin. An Inflammation range restricted to AE implies the existence of
that has-location Skin, however, cannot be a concept AE for the definition of concept BB.
classified automatically as Dermatitis. On the other hand, the classification of AE as
being subsumed by the P-node BP, the latter
3.4.2. Results being defined via the role anatomical-part-of
In the anatomy sample, only 76 concepts restricted to BE, implies the existence of BE
out of 100 could be unequivocally classified given the existence of AE (cf. Fig. 5, left).
as belonging to ‘canonical’ anatomy. (The This assumption does not always match ‘real’
remainder, e.g. ana-Phalanx-of-Supernumer- anatomy, i.e. anatomical concepts that may
ary-Digit-of-Hand, referring to pathological exhibit pathological modifications. Fig. 5 (left
anatomy was immediately excluded from part) sketches a concept AE that is necessarily
analysis.) Besides the assignment to the anatomical-part-of a concept BE, but whose
UMLS semantic types, only 27 (direct) taxo- existence is not required for the definition of
nomic links were found. Another 83 UMLS BE. This is typical of the results of surgical
relations (mostly CHilD or RN (narrower) interventions, e.g. a large intestine without an
relations) were manually upgraded to taxo- appendix, or an oral cavity without teeth, etc.
nomic links. 12 (direct) part-of and 19 has-
part relations were found. Four part-of 3.4.4. Results
relations and one has-part relation had to be All 112 has-part relations obtained by the
removed, since we considered them as im- automatic import and the manual workup of
plausible. 51 UMLS relations (mostly CHilD our sample were checked. The analysis re-
or RN (narrower) relations) were manually vealed that more than half of them (62)
upgraded to part-of relations, and 94 UMLS should be eliminated in order not to obviate
216 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

Fig. 5. Patterns for partonomic reasoning using SEP triplets: anatomical-part-of without has-anatomical-part (left),
has-anatomical-part without anatomical-part-of (right).

a coherent classification of pathologically 3.4.5. Analysis of the sibling relations and


modified anatomical objects.9 As an example, defining concepts as being disjoint
most instances of Ileum do not contain a In the UMLS mrrel table, the SIB(ling)
Meckel’s Di6erticulum, whereas all instances relation targets at concepts which share the
of Meckel’s Di6erticulum are necessarily same parent in a taxonomic or partonomic
anatomical-part-of Ileum. Many surgical in- hierarchy. Pairs of sibling concepts may ei-
terventions that remove anatomical struc- ther have common descendants or not. If not,
tures (appendix, gallbladder, etc.) produce they constitute the root of two disjoint sub-
similar patterns. In our formalism, this corre- trees. In a taxonomic hierarchy, this means
sponds to a single taxonomic link between a that one concept implies the negation of the
P-node and a S-node (cf. Fig. 5, left part). other (e.g. a benignant tumor cannot be a
The non-linkage situation is also possible (cf. malignant one, et vice versa). In a partitive
Fig. 5, right part). The definition of AE does hierarchy, this corresponds to two topologi-
not imply the role anatomical-part-of to be cally disconnected objects, CE and DE, with
filled by BE, but BE does imply that the the following interpretation: There are no
inverse role be filled by AE. As an example, a common parts shared by any instance of CE
Lymph-Node necessarily contains Lymph- with any instance of DE. In our triplet for-
Follicles, but there exist Lymph-Follicles that malism this can be expressed as follows (for a
are not part of a Lymph-Node. This pattern is formalization and further discussion of topo-
characteristic of mereological relations be- logical aspects in anatomical ontologies, cf.
tween macroscopic (countable) objects, such [39,40]): topological disconnectedness refers
as organs, and multiple uniform microscopic to a pair of concepts, CE and DE, whose
objects [34]. S-nodes, CE and DE, belong to two disjoint
subgraphs (i.e. CS implies the negation of
DS). As a consequence there is no instance
that is both anatomical-part-of an instance of
CE and anatomical-part-of of an instance of
9
In Table 1, the concepts marked by italics, viz. Aortic- DE, (cf. Fig. 6). As an example, the concepts
6al6e, Tricuspid-6al6e, Pulmonary-6al6e and Mitral-6al6e Right Hand and Left Hand are topologically
should all be eliminated from the :has-part list, because they
may be missing in certain cases as a result of congenital
disconnected, whereas Right Hand and Right
malformations, inflammatory processes or surgical interven- Forearm are not (there are instances which
tions. share a common boundary structure).
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 217

3.4.6. Results Enteritis, though Appendix is related to Intes-


We found, on average, 6.8 siblings per tine via an anatomical-part-of relation. In the
concept in the anatomy domain, and 8.8 in second case, the target is the S-node of the
the pathology domain. So far, the analysis of anatomical triplet, and, thus, the propagation
sibling relations has been performed only for of roles is enabled. Glomerulonephritis (has-
the anatomy domain. From a total of 521 location Glomerulum) is therefore subsumed
sibling relations, 9 were identified as is-a, 14 by Nephritis (has-location Kidney), since
as part-of, and 17 as has-part, whereas 404 Glomerulum is defined as an anatomical-part-
referred to topologically disconnected of Kidney. In the same way, Perforation-of-
concepts. Appendix is generalized as Intestinal-
Perforation (cf. [30,31] for a comprehensive
3.4.7. Completion and modification of analysis and formal specification of these
anatomy – pathology relations phenomena).
For each pathology concept (such as deter-
mined by the LOOM system after classifica- 3.4.8. Results
tion) it has to be checked whether the In our random sample we found 522
anatomy – pathology links are correct and anatomy–pathology relations, from which
complete. Incorrect constraints have to be 358 (i.e. 69%!) were judged as incorrect by
removed from a concept definition itself or the domain experts. In 36 cases an adequate
from one of the subsuming concepts. For anatomy–pathology relation was missing. All
each correct anatomy–pathology relation the 164 has-location roles were analyzed as to
decision must be taken whether the E-node whether they were to be filled by an S-node
or the S-node has to be addressed as the or an E-node of an anatomical triplet. In 153
target concept for modification. In the first cases, the S-node (which allows propagation
case, the propagation of roles across part- across the part-whole hierarchy) was consid-
whole hierarchies is disabled. As an example ered to be adequate; in 11 cases the E-node
(cf. Fig. 7), Enteritis implies has-location In- was preferred. The analysis of the random
testine. The range of the relation has-location sample of 100 pathology concepts revealed
is restricted to the E-node of Intestine, IE. that only 17 of them were to be linked with
This precludes, for instance, the computation an anatomy concept. In 15 cases, the default
of an is-a relation between Appendicitis and linkage to the S-node was considered to be
correct, in one case the linkage to the E-node
was preferred, in another case a given linkage
was considered to be false.
The high number of implausible con-
straints points to the lightweight semantics of
has-location links in the UMLS sources.
While we interpreted them in terms of a
conjunction for the import routine, a disjunc-
tive meaning seems to prevail implicitly in
many definitions of top-level concepts such as
Tuberculosis. In this example, we find all
Fig. 6. Triplet representation for topologically discon- anatomical concepts that can be affected by
nected concepts. this disease, linked by has-location. All these
218 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

Fig. 7. Alternative linkages of a pathology concept: either to the S-node or to the E-node of an anatomical triplet in
order to enable or preclude computation of is-a relations, respectively.

constraints (e.g. has-location Urinary-Tract) be fulfilled is the need to make consistent,


are inherited to subconcepts such as Tubercu- conceptually rich knowledge bases available
losis-of-Bronchus. Hence, a thorough analysis so that their inference engines can derive
of the top-level pathology concepts is neces- valid results. While there is a long tradition
sary, and conjunctions of constraints will of developing medical knowledge bases from
have to be substituted by disjunctions where scratch, we here propose a conservative ap-
necessary. proach —reuse existing large-scale reposito-
ries, but refine the data from these resources
so that advanced requirements imposed by
4. Conclusions more expressive knowledge representation
languages are met. Consistency checking
There is a growing demand for high-qual- comes almost for free, once the informal
ity terminology services and their embedding knowledge sources are embedded in a formal
in functionally advanced health information reasoning framework (cf., e.g. the work of
systems. Among the desiderata that have to Mejino and Rosse who recognized inconsis-
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 219

tencies in the UMLS based on formal repre- sentation structures, the step-wise refinement
sentation structures in the Digital Anatomist methodology we propose already inherits its
model [41]). The resulting knowledge bases power from the terminological reasoning
can then be used for sophisticated applica- framework. In our concrete work, we found
tions requiring sound medical reasoning. the implications of using the terminological
The knowledge engineering approach we classifier, the inference engine which com-
have proposed in this paper does exactly this. putes subsumption relations, of utmost im-
It provides a formally solid description logics portance and of outstanding heuristic value.
framework with a modeling extension which Hence, the knowledge refinement cycles are
supports not only taxonomic reasoning, but truly semi-automatic, fed by medical exper-
also incorporates partonomic reasoning tise on the side of the human knowledge
adapted to the requirements of anatomy as engineer, but also driven by the reasoning
the foundation of medical terminology. In system which makes explicit the consequences
spite of their evident weaknesses, the subsets of (im)proper concept definitions.
of the UMLS we analyzed proved to be
useful as a source of terminological knowl-
edge on a large scale. Whereas the restitution Acknowledgements
of logical consistency could be achieved in a
straightforward way, the cleansing of the re- Stefan Schulz was supported by a grant
sulting knowledge base from inadequate con- from DFG (Ha 2097/5-2).
cept definitions and specification gaps implies
a high degree of manual involvement, which
requires enormous efforts when it has to be
performed on the knowledge base as a whole. References
A realistic setting would be to eliminate inad-
[1] A. Rossi-Mori, F. Consorti, E. Galeazzi, Stan-
equacies once and for all, but to remedy dards to support development of terminological
specification gaps only when required by con- systems for health care, Methods Inf. Med. 37
crete applications. (1998) 551 –563.
For anatomy and pathology, the domains [2] J. Ingenerf, W. Giere, Concept-oriented standard-
under analysis, this study sheds light on the ization and statistics-oriented classification: Con-
tinuing the classification versus nomenclature
conditioned usability of the conceptual ‘raw terminology, Methods Inf. Med. 37 (1998) 527 –
material’ the UMLS metathesaurus provides 539.
for knowledge engineering. For macroscopic [3] WHO, International Statistical Classification of
anatomy the existing resources proved fruit- Diseases and Health Related Problems, 10th Revi-
ful due to the inclusion of the UWDA (Uni- sion, World Health Organization, Geneva, 1992.
[4] R. Côté, SNOMED International, College of
versity of Washington Digital Anatomist, cf.
American Pathologists, Northfield, IL, 1993.
[42]) knowledge base which delivers semanti- [5] NHS, NHS Clinical Terms, Version 3.1, National
cally precise relationships. Severe weaknesses Health Service Information Authority, Leicester-
and underspecification arise in the pathology shire, 1999.
portion where the necessary linkage to [6] NLM, Medical Subject Headings. National Li-
brary of Medicine, Bethesda, MD, 1997.
anatomy proved to be entirely insufficient.
[7] D.A. Evans, J.J. Cimino, W.R. Hersh, S.M. Huff,
While plain automatic conversion from D.S. Bell, Toward a medical-concept representa-
semi-formal to formal environments causes tion language, J. Am. Med. Inf. Assoc. 1 (3)
problems of adequacy of the emerging repre- (1994) 207 –217.
220 S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221

[8] A.L. Rector, W.D. Solomon, W.A. Nowlan, T. [18] E. Mays, R. Weida, R. Dionne, M. Laker, B.
Rush, A terminology server for medical language White, C. Liang, F.J. Oles. Scalable and expres-
and medical information systems, Methods Inf. sive medical terminologies, in: J.J. Cimino (Ed.),
Med. 34 (2) (1995) 147 –157. Proceedings of the 1996 AMIA Annual Fall Sym-
[9] A. Burgun, P. Denier, O. Bodenreider, G. Botti, posium (formerly SCAMC). Beyond the Super-
D. Delamarre, B. Pouliquen, P. Oberlin, J.M. highway: Exploiting the Internet with Medical
Leveque, B. Lukacs, F. Kohler, M. Fieschi, P. Le Informatics, Washington, D.C., October 26 – 30,
Beux, A Web terminology server using UMLS for 1996, Hanley and Belfus, Philadelphia, PA, 1996,
the description of medical procedures, J. Am. pp. 259 –263.
Med. Inf. Assoc. 4 (5) (1997) 356 – 363. [19] A.L. Rector, S. Bechhofer, C.A. Goble, I. Hor-
[10] D.M. Pisanelli, A. Gangemi, G. Steve, An onto- rocks, W.A. Nowlan, W.D. Solomon, The
logical analysis of the UMLS metathesaurus, in: GRAIL concept modelling language for medical
C.G. Chute, (Ed.), AMIA’98, Proceedings of the terminology, Artif. Intell. Med. 9 (1997) 139 – 171.
1998 AMIA Annual Fall Symposium. A Paradigm [20] W.A. Woods, J.G. Schmolze, The KL-ONE family,
Shift in Health Care Information Systems: Clinical Comp. Math. Appl. 23 (2/5) (1992) 133 –177.
Infrastructures for the 21st Century, Orlando, FL, [21] K.A. Spackman, K.E. Campbell, Compositional
November 7–11, 1998, Hanley and Belfus, concept representation using SNOMED: Towards
Philadelphia, PA, 1998, pp. 810 –814. further convergence of clinical terminologies, in:
[11] C.G. Chute, P.L. Elkin, D.D. Sheretz, M.S. Tut- C.G. Chute (Ed.), AMIA’98, Proceedings of the
tle, Desiderata for a clinical terminology server, in: 1998 AMIA Annual Fall Symposium. A Paradigm
N.M. Lorenzi (Ed.), AMIA’99, Proceedings of the Shift in Health Care Information Systems: Clinical
1999 Annual Symposium of the American Medical
Infrastructures for the 21st Century, Orlando, FL,
Informatics Association. Transforming Health
November 7– 11, 1998, Hanley and Belfus,
Care through Informatics: Cornerstones for a
Philadelphia, PA, 1998, pp. 740 –744.
New Information Management Paradigm, Wash-
[22] I.J. Haimowitz, R.S. Patil, P. Szolovits, Repre-
ington, D.C., November 6–10, 1999, Hanley and
senting medical knowledge in a terminological lan-
Belfus, Philadelphia, PA, 1999, pp. 42 – 46.
guage is difficult, in: R.A. Greenes (Ed.),
[12] NLM, Unified Medical Language System, Na-
SCAMC’88, Proceedings of the 12th Annual Sym-
tional Library of Medicine, Bethesda, MD, 2001.
posium on Computer Applications in Medical
[13] J.J. Cimino, Auditing the Unified Medical Lan-
Care, Washington, D.C., IEEE Computer Society
guage System with semantic methods, J. Am.
Med. Inf. Assoc. 5 (1) (1998) 41 –45. Press, 1988, pp. 101 –105.
[14] O. Bodenreider, A. Burgun, G. Botti, M. Fieschi, [23] R. Schubert, K.-H. Höhne, Partonomies for inter-
P. Le Beux, Evaluation of the United [sic!] Medi- active explorable 3D-models of anatomy, in: C.G.
cal Language System as medical knowledge Chute (Ed.), AMIA’98, Proceedings of the 1998
source, J. Am. Med. Inf. Assoc. 5 (1) (1998) AMIA Annual Fall Symposium. A Paradigm Shift
76–87. in Health Care Information Systems: Clinical In-
[15] U. Hahn, M. Romacker, S. Schulz, How knowl- frastructures for the 21st Century, Orlando, FL,
edge drives understanding: Matching medical on- November 7–11, 1998, Hanley and Belfus,
tologies with the needs of medical language Philadelphia, PA, 1998, pp. 433 –437.
processing, Artif. Intell. Med. 15 (1) (1999) 25 – 51. [24] C. Rosse, L.G. Shapiro, J.F. Brinkley, The Digital
[16] J.A. Reggia, S. Tuhrim (Eds.), Computer-Assisted Anatomist foundational model: Principles for
Medical Decision Making, Springer, New York, defining and structuring its concept domain, in:
1985. C.G. Chute (Ed.), AMIA’98, Proceedings of the
[17] F. Volot, P. Zweigenbaum, B. Bachimont, M.B. 1998 AMIA Annual Fall Symposium. A Paradigm
Said, J. Bouaud, M. Fieschi, J.-F. Boisvieux, Shift in Health Care Information Systems: Clinical
Structuration and acquisition of medical knowl- Infrastructures for the 21st Century, Orlando, FL,
edge: Using UMLS in the Conceptual Graph for- November 7–11, 1998, Hanley and Belfus,
malism, in: C. Safran (Ed.), SCAMC’93, Philadelphia, PA, 1998, pp. 820 –824,
Proceedings of the 17th Annual Symposium on [25] A. Artale, E. Franconi, N. Guarino, L. Pazzi,
Computer Applications in Medical Care, Wash- Part-whole relations in object-centered systems:
ington, D.C., October 30 –November 3, 1993, Mc- An overview, Data Knowledge Eng. 20 (3) (1996)
Graw-Hill, New York, 1994, pp. 710 –714. 347– 383.
S. Schulz, U. Hahn / International Journal of Medical Informatics 64 (2001) 207–221 221

[26] I. Horrocks, U. Sattler, A description logic with [37] J. Heinsohn, D. Kudenko, B. Nebel, H.-J.
transitive and inverse roles and role hierarchies, J. Profitlich, An empirical analysis of terminological
Logic Comp. 9 (3) (1999) 385 –410. representation systems, Artif. Intell. 68 (2) (1994)
[27] J.G. Schmolze, W.S. Mark, The NIKL experience, 367–397.
Comp. Intell. 6 (1) (1991) 48 –69. [38] G. Carenini, J.D. Moore, Using the UMLS se-
[28] E.B. Schulz, C. Price, P.J.B. Brown, Symbolic mantic network as a basis for constructing a ter-
anatomic knowledge representation in the Read minological knowledge base: A preliminary report,
Codes Version 3: Structure and application, J. in: C. Safran (Ed.), SCAMC’93, Proceedings of
Am. Med. Inf. Assoc. 4 (1) (1997) 38 –48. the 17th Annual Symposium on Computer Appli-
[29] M. Schmidt-Schauß, G. Smolka, Attributive con- cations in Medical Care, Washington, D.C., Octo-
cept descriptions with complements, Artif. Intell. ber 30–November 3, 1993, McGraw-Hill, New
48 (1) (1991) 1 –26. York, 1994, pp. 725 –729.
[30] S. Schulz, M. Romacker, U. Hahn. Part-whole [39] S. Schulz, U. Hahn, M. Romacker, Modeling
reasoning in medical ontologies revisited: Intro- anatomical spatial relations with description log-
ducing SEP triplets into classification-based de- ics, in: J.M. Overhage (Ed.), AMIA 2000, Pro-
scription logics, in: C.G. Chute, (Ed.), AMIA’98, ceedings of the Annual Symposium of the
Proceedings of the 1998 AMIA Annual Fall Sym- American Medical Informatics Association. Con-
posium. A Paradigm Shift in Health Care Infor- verging Information, Technology, and Health
mation Systems: Clinical Infrastructures for the Care, Los Angeles, CA, November 4 –8, 2000,
21st Century, Orlando, FL, November 7 – 11,
Hanley and Belfus, Philadelphia, PA, 2000, pp.
1998, Hanley and Belfus, Philadelphia, PA, 1998,
779– 783.
pp. 830–834.
[40] S. Schulz, U. Hahn, Parts, locations, and holes:
[31] U. Hahn, S. Schulz, M. Romacker, Part-whole
Formal reasoning about anatomical structures, in:
reasoning: A case study in medical ontology engi-
S. Quaglini, P. Barahona, S. Andreassen (Eds.),
neering, IEEE Intell. Syst. Their Applic. 14 (5)
Artificial Intelligence in Medicine. Proceedings of
(1999) 59 –67.
the 8th Conference on Artificial Intelligence in
[32] D.A. Cruse, On the transitivity of the part-whole
Medicine in Europe, AIME 2001, volume 2101 of
relation, J. Linguistics 15 (1979) 29 – 38.
[33] M. Winston, R. Chaffin, D.J. Herrmann, A taxon- Lecture Notes in Artif. Intell., Cascais, Portugal,
omy of part-whole relationships, Cognitive Sci. 11 July 1-4, 2001, Springer, Berlin, 2001, pp. 293 –
(1987) 417 –444. 303.
[34] S. Schulz, Bidirectional mereological reasoning in [41] J.L.V. Mejino, Jr., C. Rosse, The potential of the
anatomical knowledge bases, in: AMIA, Proceed- Digital Anatomist foundational model for assur-
ings of the 2001 Annual Symposium of the Amer- ing consistency in UMLS sources, in: C.G. Chute
ican Medical Informatics Association. (Ed.), AMIA’98, Proceedings of the 1998 AMIA
Washington, D.C., November 3 –7, 2001. Annual Fall Symposium. A Paradigm Shift in
[35] R. MacGregor, R. Bates, The LOOM knowledge Health Care Information Systems: Clinical In-
representation language, Technical Report RS-87- frastructures for the 21st Century, Orlando, FL,
188, Information Sciences Institute, University of November 7– 11, 1998, Hanley and Belfus,
Southern California, 1987. Philadelphia, PA, 1998, pp. 825 – 829.
[36] R. MacGregor, A description classifier for the [42] C. Rosse, J. Leonardo, V. Mejino, B.R. Modayur,
predicate calculus, in: AAAI’94, Proceedings of R. Jakobovits, K.P. Hinshaw, J.F. Brinkley, Moti-
the 12th National Conference on Artificial Intelli- vation and organizational principles for anatomi-
gence, vol. 1, Seattle, WA, July 31 –August 4, cal knowledge representation: The Digital
1994, AAAI Press and MIT Press, Menlo Park, Anatomist symbolic knowledge base, J. Am. Med.
CA, 1994, pp. 213 –220. Inf. Assoc. 5 (1) (1998) 17 –40.

You might also like