Mining Association Rules From Semantic Web Data: June 2010

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221049296

Mining Association Rules from Semantic Web Data

Conference Paper · June 2010


DOI: 10.1007/978-3-642-13025-0_52 · Source: DBLP

CITATIONS READS

16 315

2 authors:

Victoria Nebot Rafael Berlanga-Llavori


Universitat Jaume I Universitat Jaume I
34 PUBLICATIONS   432 CITATIONS    145 PUBLICATIONS   1,426 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SLOD-BI: Development of an open linked data infrastructure for Social Business Intelligence aimed at SMEs View project

All content following this page was uploaded by Victoria Nebot on 09 May 2014.

The user has requested enhancement of the downloaded file.


Knowledge-Based Systems 25 (2012) 51–62

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Finding association rules in semantic web data


Victoria Nebot ⇑, Rafael Berlanga
Departamento de Lenguajes y Sistemas Informáticos, Universitat Jaume I, Campus de Riu Sec, 12071 Castellón, Spain

a r t i c l e i n f o a b s t r a c t

Article history: The amount of ontologies and semantic annotations available on the Web is constantly growing. This new
Available online 26 May 2011 type of complex and heterogeneous graph-structured data raises new challenges for the data mining
community. In this paper, we present a novel method for mining association rules from semantic instance
Keywords: data repositories expressed in RDF/(S) and OWL. We take advantage of the schema-level (i.e. Tbox) knowl-
Semantic web edge encoded in the ontology to derive appropriate transactions which will later feed traditional associ-
Data mining ation rules algorithms. This process is guided by the analyst requirements, expressed in the form of query
Association rules
patterns. Initial experiments performed on semantic data of a biomedical application show the usefulness
Semantic annotation
Biomedical application
and efficiency of the approach.
 2011 Elsevier B.V. All rights reserved.

1. Introduction of semantic data is quite different from that of traditional data sets
treated by these algorithms. Thus, the main challenges we face in
In the past few years, there has been an increasing interest in this work are the following ones:
combining the two research areas semantic web (SW) and data
mining (DM) [1,33]. Thanks to the standardization of the ontology  Traditional DM algorithms deal with homogeneous data sets
languages RDF/(S)1 and OWL,2 the SW has been realized and the composed by transactions, where each transaction is repre-
amount of available semantic annotations is ever increasing. This sented by a subset of items. In contrast, in a repository of
is due in part to the active research concerned about learning knowl- semantic annotations expressed in OWL we keep ontology axi-
edge structures from textual data, usually referred as Ontology oms, describing the conceptual domain, and the semantic anno-
Learning [9]. On the other hand, background knowledge has been tations are represented as assertions relating instances through
also used to improve the results of Web mining. However, little work properties that are consistent with the ontology. The usual way
has been directed towards mining SW data itself, that is, mining the to represent these assertions is as triples (subject, predi-
SW. We strongly believe that mining SW data will bring much ben- cate, object). In this scenario, the identification of transactions
efit to many domain-specific research communities where relevant and items is not trivial. Items may correspond to either
data are often complex and heterogeneous, and a large body of instances or literals, and a transaction is defined according to
knowledge is available in the form of ontologies and semantic anno- the user requirements as a subset of items semantically related
tations. This is the case of the clinical and biomedical scenarios, in the repository.
where applications often have to deal with large volumes of complex  OWL is sustained by description logics (DLs) [6], which are
data sets with different structure and semantics. In this paper, we knowledge representation formalisms with well-understood
investigate how ontological instances expressed in OWL can be com- formal properties and semantics. Therefore, annotated data
bined into transactions in order to be processed by traditional asso- does not follow a rigid structure. That is, instances belonging
ciation rules algorithms [2], and how we can exploit the rich to the same OWL class may have different structures, giving
knowledge encoded in the respective ontologies to reduce the search place to structural heterogeneity issues.
space.  Since DLs are defined with formal semantics, reasoning capabil-
Machine learning algorithms have been successfully applied to ities must be applied in order to handle the implicit knowledge.
large data sets to extract useful knowledge by searching for
interesting patterns (e.g., association rules). However, the nature As far as we know, the presented approach is the first attempt
to find association rules directly from SW data. Previous work on
⇑ Corresponding author. Tel.: +34 964 72 83 67; fax: +34 964 72 84 35. mining SW data has been mainly focused on clustering and classi-
E-mail addresses: [email protected] (V. Nebot), [email protected] (R. Berlanga).
fying ontology instances [8,12]. However, the representation tech-
1
RDF/(S): http://www.w3.org/TR/rdf-concepts/,rdf-schema/, ’04. niques required in association mining are quite different from
2
OWL: http://www.w3.org/TR/owl-features/, ’04. clustering and classification tasks. Association rules are based on

0950-7051/$ - see front matter  2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2011.05.009
52 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

the notion of transaction, which is an observation of the co- instance data. In [8], it is shown how kernel-based machinery
occurrence of a set of items. This is basically a set-based repre- can encapsulate the ontology knowledge into the vector-based
sentation of the world which contrasts with the numerical representation suited for support vector machines. For clustering
vector-based representations used in clustering and classification. purposes, a series of similarity (dissimilarity) measures are pro-
When dealing with SW data, the main challenge consists in identi- posed to mine either semi-structured data [11,14] or ontological
fying interesting transactions and items from the semi-structured instances [12]. In the latter case, knowledge derived from the
and heterogeneous nature of these data. In this way, it becomes ontology is used to define the weights of the similarity measures.
crucial to use as much as possible the knowledge provided by However, as previously mentioned, these representations are not
the ontologies so that transactions can be easily defined and suited for association mining, which requires to define both the
mined. As we will show along this paper, there is a great variety set of items and the transactions from the semantic data.
of ways to generate items and transactions from semantic data, Another related topic to this work is that of mining tree and
which depend on the detail level and structural semantics the graph structured data. In this line, we can find frequent subtree
analyst wants to consider. [10] and graph mining [20], whose aim is to identify frequent sub-
The main contribution of this paper is twofold: (1) we define structures in complex data sets. Albeit interesting, these algo-
the problem of transaction generation and mining association rules rithms do not serve the purpose of finding interesting content
from SW data, and (2) we propose a method to efficiently extract associations in RDF/(S) and OWL graphs because they are con-
items and transactions suited for traditional association rules min- cerned with frequent syntactic substructures but not frequent
ing algorithms. This work extends that presented in [28] in the fol- semantically related contents. Indeed, frequent graph substruc-
lowing aspects: a more elaborate and complete formalism is tures usually hide interesting associations that involve contents
presented, we have updated the related work, and an exhaustive represented with different detail levels of the ontology. Moreover,
evaluation over a real scenario has been carried out. although the underlying structure of RDFS and OWL is a graph, rea-
The rest of the paper is organized as follows. Section 2 gives an soning capabilities must be applied to handle implicit knowledge.
overview of the related work. Section 3 explains the basics of the The way we define transactions is similar to those proposed for
two integrated technologies, association rules mining and OWL analysing highly heterogeneous XML data sets. More specifically,
DL ontologies and motivates the problem with a running example. in [22,38] it is shown that XQuery is not the most suitable query
Section 4 contains the general methodology and foundations of the language for data extraction from heterogeneous XML data
approach. Section 5 shows the experimental evaluation and Sec- sources, since the user must be aware of the structure of the under-
tion 6 gives some conclusions and future work. lying documents. The lowest common ancestor (LCA) semantics
can be applied to extract meaningful related data in a more flexible
way. [22,38] apply some restrictions over the LCA semantics. In
2. Related work particular they propose SLCA [38] and MLCA [22] whose general
intuition is that the LCA must be minimal. However, in [27,30] they
Most research on DM for semantic data is based on inductive showed that these approaches still produced undesired combina-
logic programing (ILP) [26], which exploits the underlying logic tions between data items in some cases (e.g. when a data item
encoded in the data to learn new concepts. Some examples are needs to be combined with a data item at a lower level of the doc-
presented in [23,17]. However, there is the inconvenient of rewrit- ument hierarchy). In order to alleviate the previous limitations
ing the data sets into logic programming formalisms and most of they propose the SPC (smallest possible context) data strategy,
these approaches are not able to identify some hidden concepts which relies on the notion of closeness of data item occurrences
that statistical algorithms would. in an XML document. A similar strategy is presented in [34].
Generalized association rules [32] were formerly proposed to Finally, we find some work aimed at integrating knowledge dis-
consider item taxonomies for mining association rules. The main covery capabilities into SPARQL3 by extending its grammar. Some
idea behind this method is to extend itemsets with all the ances- examples are [18], which can be plugged with several DM algorithms
tors of each item, then computing the frequent itemsets, and final- and [19], which finds complex path relations between resources. In-
ly filtering out those rules containing an item and some of its spired by these works, we have also extended SPARQL grammar to
ancestors. The resulting rules are expressed at different taxonomy define association rule patterns over the ontological data but in a
detail levels, avoiding redundant rules present in the taxonomy. less restrictive way than the one imposed by SPARQL. These patterns
Clearly, this method overheads considerably the transactions allow the system to focus only on the interesting features, reducing
length and therefore the mining algorithm. Additionally, the num- both the number and length of generated transactions.
ber of generated rules is much larger than disregarding the taxon-
omy. Several optimizations have been proposed to alleviate this
3. Preliminaries
overhead. [15] presents two association rules mining algorithms
as the core of the Web personalization process. Both algorithms
This section gives some background about the two integrated
are efficient in the sense that they avoid the costly generation of
research areas, the semantic web and association rules, and intro-
candidate sets and the over-generalization of rules. A pattern-
duces our mining problem statement in this context.
fragment growth method is adopted along with efficient pruning.
Moreover, the special features of Web 2.0 applications are also
addressed, where the taxonomies are not predefined but usually 3.1. Semantic web data
described by user-defined tags. In [36] the problem of efficiently
updating the discovered generalized association rules when the Semantic web technologies are aimed at providing the neces-
item taxonomies are modified is addressed. Nevertheless, our work sary representation languages and tools to bring semantics to the
is not concerned with this kind of association rules but with the current web contents. As a result, the W3C consortium has pro-
definition of transactions for SW data. Once the transactions are posed several representation formats, all relying on XML. The re-
generated we can mine any kind of association rules, like general- source description language (RDF) was the first language
ized and quantitative ones. proposed by the W3C to describe semantic meta-data. In RDF there
Other studies extend statistical machine learning algorithms to
be able to directly deal with ontologies and their associated 3
SPARQL: http://www.w3.org/TR/rdf-sparql-query,’08.
V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62 53

are three kinds of elements: resources, literals, and properties. (called oneOf in OWL) and existential ($), universal (") and
Resources are web objects (entities) that are identified through a cardinality (P, 6, =) restrictions involving an atomic role R or its
URI, literals are atomic values such as strings, dates, numbers, inverse R. In OWL DL it is possible to assert that a concept C is
etc., and properties are binary relationships between resources subsumed by D (C v D), or is equivalent to D (C  D). Equivalence
and literals. Properties are also identified through URIs. The basic and subsumption can be also asserted between roles and roles
building block of RDF is the triple: a binary relationship between can have special constraints (e.g., transitivity, simmetry, function-
two resources or between a resource and a literal. The resulting ality, etc.) Regarding instance axioms, we can specify the class C of
metadata can be seen as a graph where nodes are resources and an instance a (C(a)), or the relations between two instances a and b
literals, and edges are properties connecting them. RDFS extends (R(a, b)). A DL ontology consists of a set of axioms describing the
RDF by allowing triples to be defined over classes and properties. knowledge of an application domain. This knowledge ranges over
In this way, we can describe the schema that rules our metadata the terminological cognition of the domain (the concepts of inter-
within the same description framework. Later on, the ontology est, its Tbox) and its assertions (the instances of the concepts, its
web language (OWL) was proposed to provide well-founded logic Abox). Fig. 2 shows an excerpt of the Tbox for the running example
inferences over semantic descriptions. Most sublanguages of OWL and Table 1 shows a fragment of the Abox.
are supported by decidable subsets of the first order logic, which In this paper, we separate the Tbox from the Abox for practical
are known as description logics (DLs) [6]. Additionally, OWL syntax issues. In general, we use the term ontology (O) to refer to the Tbox
relies on RDF so that technology developed for RDF like triple stores and instance store (IS) to refer to the Abox. The ontology is quite sta-
and query languages can be directly applied to OWL. In this work tic since the terminological axioms describing a domain do not
we mainly focus on OWL-DL semantic descriptions. change frequently, while the instance store is very dynamic because
As application scenario, we have chosen the biomedical field. In it is constantly being updated with new assertions. An instance
this scenario, a huge amount of semantic data are continuously store is in fact a triple store consisting of RDF triples whose in-
being generated. Specifically, most of the semi-structured and stances refer to the ontology. We assume that the instance store
highly heterogeneous data sources (e.g. laboratory test reports, is consistent w.r.t. the associated ontology.
ultrasound scans, images, etc.) are being semantically annotated As previously mentioned, the main advantage of adopting OWL
through comprehensive domain ontologies like UMLS, NCI and is the inference capability. Thus, we are able to infer implicit
Galen. Fig. 1 shows an excerpt of a clinical report whose semantic knowledge derived from both the ontology axioms and the in-
annotation results in a set of DL axioms (Fig. 2) and assertions (Ta- stance store. The main inference tasks we require in our work
ble 1). The version of the DL used is ALCHOIQðDÞ. The axioms in are combinations of the following ones: concept subsumption,
Fig. 2 capture the semantics of all the information related to a pa- O  C v D, for any two concepts C and D from O, instance classifica-
tient (i.e. the medical history, reports, laboratory results, etc.) by tion, O [ IS  C(a) for any concept C 2 O and instance a 2 IS, and in-
conceptualizing the domain. Next section explains in detail the stance relationship, O [ IS  R(a, b) for any property R 2 O, and
OWL-DL constructors used. Table 1 shows the instance data, that instances a, b 2 IS. Additionally, we use the function MSC(i) to
is, the triples that describe some patient. The generated data also denote the most specific concept from O to which the instance i
presents complex relationships that evolve rapidly as new biomed- belongs (usually the asserted one). All these inference problems
ical research methods are applied. Clearly, traditional analytical have been demonstrated to be decidable in most of the DL families,
tools are not appropriate for this kind of data. and for some of them (e.g. OWL 2 profiles) the computational
complexity is polynomial.

3.1.1. Description logics 3.2. Association rules


Description logics allow ontology developers to define the do-
main of interest in terms of individuals, atomic concepts (called clas- The problem of discovering association rules was first intro-
ses in OWL) and roles (called properties in OWL). Concept duced in [2]. It can be formally described as follows. Let
constructors allow the definition of complex concepts composed I = {i1, i2, . . . , im} be a set of m literals, called items. Let
of atomic concepts and roles. OWL DL provides for union (t), inter- D = {t1, t2, . . . , tn} be a database of n transactions where each trans-
section (u) and complement ð:Þ, as well as enumerated classes action is a subset of I. An itemset is a subset of items. The support
of an itemset S, denoted by sup(X), is the percentage of transactions
in the database D that contain S. An itemset is called frequent if its
support is greater than or equal to a user specified threshold value.
An association rule r is a rule of the form X ) Y where both X
and Y are non-empty subsets of I and X \ Y = ;. X is the antecedent
of r and Y is called its consequent. The support and confidence of
the association rule r: X ) Y are denoted by sup(r) and conf(r).
These measures can be interpreted in terms of estimated probabil-
ities as sup(r) = P(X [ Y) and conf(r) = P(YjX) respectively. Another
interesting measure is the lift or interest factor of a rule, denoted
by lift(r), which computes the ratio between the rule probability
and the joint probability of X and Y assuming they are independent,
that is, P(X [ Y)/(P(X)  P(Y)). If this value is 1, then X and Y are
independent. The higher this value, the more likely that the exis-
tence of X and Y together in a transaction is not just a random
occurrence, but because of some relationship between them.
The basic task of association data mining is to find all associa-
tion rules with support and confidence greater than user specified
minimum support and minimum confidence threshold values [2].
Mining association rules basically consists of two phases: (1) com-
Fig. 1. Fragment of a clinical report of the Rheumatology domain. pute the frequent itemsets w.r.t. the minimum support, and (2)
54 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

Fig. 2. Ontology axioms (Tbox).

Table 1
Semantic annotations (Abox). Subject, predicate and object are URIs pointing to the respective resources. The fourth column shows the types of the related entities (we omit the
triple statements asserting the type of the resources for space issues).

Subject Predicate Object Related entities


PTNXZ1 hasAge ‘‘10’’ Patient and ‘‘xsd:Integer’’
PTNXZ1 sex Male Patient and Sex
VISIT1 date ‘‘06182008’’ Visit and ‘‘xsd:Date’’
VISIT1 hasReport RHEX1 Visit and Rheumatology
RHEX1 damageIndex ‘‘10’’ Rheumatology and ‘‘xsd:Float’’
RHEX1 results ULTRA1 Rheumatology and Ultrasonography
ULTRA1 hasAbnormality ‘‘Malformation’’ Ultrasonography and ‘‘xsd:String’’
ULTRA1 hasAbnormality Knee Ultrasonography and Body_Space_or_Junction
VISIT1 hasReport DIAG1 Visit and Diagnosis
DIAG1 hasDiagnosis Arthritis Diagnosis and Disease_or_Syndrome
VISIT1 hasReport TREAT1 Visit and Treatment
TREAT1 hasDrugTherapy DT1 Treatment and DrugTherapy
DT1 hasDrug Methotrexate DrugTherapy and Pharmacologic_Substance
PTNXZ1 hasVisit VISIT2 Patient and Visit
VISIT2 date ‘‘08202008’’ Visit and ‘‘xsd:Date’’
VISIT2 hasReport RHEX2 Visit and Rheumatology
RHEX2 damageIndex ‘‘15’’ Rheumatology and ‘‘xsd:Float’’
RHEX2 results ULTRA2 Rheumatology and Ultrasonography
ULTRA2 hasAbnormality ‘‘Malformation’’ Ultrasonography and ‘‘xsd:String’’
ULTRA2 hasAbnormality Knee Ultrasonography and Body_Space_or_Junction
RHEX2 results ULTRA3 Rheumatology and Ultrasonography
ULTRA3 hasAbnormality ‘‘Rotation 15’’ Ultrasonography and ‘‘xsd:String’’
ULTRA3 hasAbnormality Right_Wrist Ultrasonography and Body_Space_or_Junction
VISIT2 hasReport DIAG2 Visit and Diagnosis
DIAG2 hasDiagnosis Systemic_Arthritis Diagnosis and Disease_or_Syndrome
VISIT2 hasReport TREAT2 Visit and Treatment
TREAT2 hasDrugTherapy DT2 Treatment and DrugTherapy
DT2 hasDrug Methotrexate DrugTherapy and Pharmacologic_Substance
TREAT2 hasDrugTherapy DT3 Treatment and DrugTherapy
DT3 hasDrug Corticosteroids DrugTherapy and Pharmacologic_Substance
   

generate rules from the frequent itemsets w.r.t. the minimum cardinalities of X and Y (e.g. horn-like rules), and the kind of
confidence. As the number of frequent sets can be very large, itemsets that can participate in the rules (e.g. generalized rules
closed and maximal itemsets can be generated instead. There [32], quantitative association rules [21,25]), as well as time-
are also several types of association rules depending on the dependent rules [31].
V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62 55

Fig. 3. Arquitecture of our proposal for mining semantic annotations.

Fig. 5. Example of extended SPARQL query with CREATE MINING MODEL


statement.

3.3. Instance store association rules

Both the ontology and the instance store can be encoded into tri- In other words, the items co-occurring together under the same
ples of the form ‘‘subject-predicate-object’’, which basically form a context instance are grouped into a transaction. In this new sce-
graph where nodes are resources and literals, and edges are prop- nario the set of items I = {i1, i2, . . . , im} is not a set of literal values de-
erties connecting them. This dynamic and graph-based structure fined a priori (as opposed to the traditional scenario). Instead,
contrasts with the well-structured and homogeneous datasets that items are complex literals dependent on the context specified by
feed traditional association rules algorithms. Therefore, in order to the user. Therefore, they are dynamically derived from the user
obtain items and transactions from the IS, users must state the tar- DM pattern. It is worth mentioning that the generated transactions
get concept of analysis and their involved features. The features are binary.
must have some kind of relation with the target concept, that is,
they will be extracted from the subgraph around each instance Definition 4 (Data mining problem). Given a set of transactions
belonging to the target concept of analysis. The following defini- generated from an instance store IS consistent with the ontology O
tions introduce the basic concepts of association rules mining in and a mining pattern Q, a data mining problem consists in finding
the context of SW data. association rules with minimum support and confidence threshold
values. Support and confidence measures are calculated in the
Definition 1. A property chain (r1    rn) is an aggregation path traditional way by taking into account the frequent itemsets.
from concept C to concept C0 iff O C v $r1    rnC0 . The set of all
aggregation paths from C to C0 is denoted Path(C, C0 ). Aggregation The previous DM problem can be thought as a classic DM prob-
paths can be of length one. lem where the input data must be derived from the ontology in or-
Property chains will allow us to face the structural heterogene- der to generate transactions according to the user specification.
ity present in the instance store.
4. Methodology
Definition 2 (Ontology data mining pattern). An ontology mining
pattern Q is a triple (CT, Cctxt, Features), where CT is the target In this section we present a detailed view of our method along
concept to which all the concepts in Q must be related in the with the definitions that sustain it. Fig. 3 depicts a schematic over-
ontology O through property chains, Cctxt is the context concept view of the whole process. The user specifies a mining pattern fol-
from which the transactions are built, and Features is a set of lowing an extended SPARQL syntax. Then, the transaction extractor
concepts from which transaction items are derived from. We will is able to identify and construct transactions according to the min-
use the functions target(Q), context(Q), and features(Q) to refer to ing pattern previously specified. Finally, the set of transactions ob-
Q’s components respectively. tained are processed by a traditional pattern mining algorithm,
In the definition of a mining pattern, we can indicate that an which finds association rules of the form specified in the mining
item can be derived from a concept through a data type property pattern with support and confidence greater than user’s specified
by using the DL expression C u $DT, where DT is the data type ones.
property used for generating the item.
4.1. Mining pattern specification
Definition 3. Transaction generation problem. Given an instance
store IS consistent with the ontology O, and a mining pattern Q, The user has to specify the kind of patterns (s) he is interested in
find co-ocurrences between items derived from instances belong- obtaining from the repository. Since semantic annotations are en-
ing to features(Q), which must happen within instances belonging coded in RDF/(S) and OWL, we have extended SPARQL with a new
to context(Q) associated to instances of target(Q). statement that allows to specify a mining pattern. The syntax is in-

Fig. 4. Extended SPARQL grammar for the CREATE MINING MODEL statement.
56 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

Fig. 6. Example of ontology graph fragment.

Fig. 7. Example of instance store fragment consistent with ontology fragment of Fig. 6.

Table 2
Example of composition triples.

Subject Composition path Object Feature


PTN_XY21 (VISIT1, RHEX1, ULTRA1) Malformation Disease
PTN_XY21 (VISIT1, RHEX1) RHEX1 Report u $ damageIndex
PTN_XY21 (VISIT1, TREAT1, DT1) Methotrexate Drug
PTN_XY21 (VISIT2, RHEX2, ULTRA2) Malformation Disease
PTN_XY21 (VISIT2, RHEX2) RHEX2 Report u $ damageIndex
PTN_XY21 (VISIT2, RHEX2, ULTRA3) Bad rotation Disease
PTN_XY21 (VISIT2, TREAT2, DT2) Methotrexate Drug
PTN_XY21 (VISIT2, TREAT2, DT3) Corticosteroids Drug
   

spired by the Microsoft Data Mining Extension (DMX), which is an The user has selected Patient as target concept of analysis. The set of
SQL extension to work with DM models in Microsoft SQL Server features that will populate the transactions include diagnosed
Analysis Services.4 diseases, prescribed drugs and the damage index5 measure of the
The extended SPARQL grammar is depicted in Figs. 4 and 5 patient. Finally, the transaction will be built at the granularity of
shows the SPARQL specification for the following mining pattern: report, that is, transactions will not include features across reports
but only features within each report. The previous mining pattern
Q ¼ ðPatient; Report; fDisease; Drug; Report u 9 damageIndexgÞ

5
the damage index is a numerical score that measures the damage of the
4
DMX Reference: http://technet.microsoft.com/en-us/library/ms132058.aspx. articulations of rheumatic patients.
V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62 57

Table 3 Average confidence of the rules


Item transactions generated for the running example.

1.00
xxxxxxxxxxxxxxxxx xxx *************************** ++++++++++++
# Transaction xx x ●
***
xxxx * +++
xx * +●
1 {RheuExam.Disease.Malformation, RheuExam.RheuExam. damageIndex ? 10} xx x **
** * +++
2 {Treatment.Drug.Methotrexate} x
x * +++
xx *
3 {RheuExam.Disease.Malformation, RheuExam.Disease.BadRotation, xx * +

0.95
RheuExam.RheuExam.damageIndex ? 15} xx ++
●x *
4 {Treatment.Drug.Methotrexate, Treatment.Drug.Corticosteroids}
xx ** ++
xx
xx * ++

Avg. confidence
x xx * +
*
****

0.90
Table 4
Generated transactions using different contexts for the HeC instance store.

Context # Trans. Avg. length


Patient 588 29.57

0.85
Visit 1458 12.84
Report 3608 5.24
x Patient
* Visit
+ Report

0.80
Coverage of the transactions
0.30 0.25 0.20 0.15 0.10 0.05 0.00
support
1.0

x Patient
x xx Fig. 9. Average confidence of the generated rules with different minimum support
* Visit xxx x xxxx
+ Report thresholds.
0.8

xx
x
xxx xxxx
x xxxx x Coverage*Avg. confidence of the transactions
xx
1.0

*
0.6

x **
*
coverage

*** x Patient
* Visit
*** + Report
**
*
0.8
0.4

x xxxxxxxx xxxxxxxxxx *
*

****
coverage*avg_conf

XXXXXXX
0.6
0.2

** X XX
XXX
************************ X * **
X ** * *
X *
0.4

X
0.0

* *
XXXXXXXXXXXXXXXXXXX X +++
XX ++++
0.30 0.25 0.20 0.15 0.10 0.05 0.00 X **** +
X * +++++ +
support
+
0.2

X
Fig. 8. Coverage of the transactions achieved by the generated rules with different X *+
**
X *+ +
+
minimum support thresholds. X ++ +
X + *
X
X +
0.0

************************ X++
++
++
++
++ * +
can be specified in a SPARQL-fashion as follows. We extend the
0.30 0.25 0.20 0.15 0.10 0.05 0.00
SPARQL grammar rule Query by adding a new symbol, named Min- support
ingQuery. This symbol expands to the keywords CREATE MINING
MODEL followed by the Source, which identifies the input repository. Fig. 10. Coverage multiplied by the average confidence of the generated rules with
different minimum support thresholds.
The body consists of the variables the user is interested in with the
purpose of data mining. Next to each variable, we specify its content
type: RESOURCE for variables holding RDF resources and LITERAL for to. In case variables are not resources, users must specify the domain
variables holding literal values. In case we want to find patterns with and data type property from which that literal is reachable. In the
just one occurrence of the variable, we attach the keyword MAX- example, the variable jadi refers to the damage index score of a pa-
CARD1 to the variable. By default, patterns found can contain more tient joint, which is a literal, so the user specifies report and damage-
than one occurrence of each variable. Moreover, we specify the con- Index as the resource and data type property from which the value
sequent of the rule by attaching the keyword PREDICT (notice that can be accessed, respectively. The UsingClause grammar symbol de-
this is optional). Finally, the keyword TARGET denotes the resource fines the name and parameters of the learning algorithm. Finally,
under analysis, which must be an ontology concept. The analysis tar- the MeasuresClause is used to specify other interestingness measures
get determines the set of obtained rules. In the WhereClause, we that the mining model should provide.
specify the restrictions over the previous variables. The good news Since we do not ask the user to specify the exact relations, the
is that we do not expect users to have an exact knowledge of the previous query model introduces some ambiguity regarding the
ontology structure. Therefore, users do not have to input the paths items that form a transaction. That is to say, the same conceptual
relating the pattern variables in SPARQL. Instead, they are asked to entities (i.e. selected features) may appear under different contexts
specify just the type (i.e. ontology concept) that the variables refer in the ontology. For example, Disease may refer to the patient’s
58 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

Table 5 The set of aggregation paths between i and i0 is denoted Path(i, i0 ).


Quality measures of the generated rules for the different transaction sets.
The first condition of the definition states that there must be
Context Min. sup. # Rules Avg. conf. Avg. lift /-coeff Cov. a property chain between both instances in the IS, and the
Patient (588) 0.187 109 0.993 2.944 0.796 0.678 second one states that such a path must be also derived from
Visit (1458) 0.047 93 0.976 9.975 0.865 0.480 the Tbox.
Report (3608) 0.017 151 0.964 27.69 0.836 0.169
Definition 8. Let iT 2 IT(Q) be a target instance, and ia, ib 2 IS two
named instances. Contexts(ia, ib, iT) represent the set of common
Table 6 reachable instances, which is defined as follows:
Quality measures of the generated rules at the context of patient with minimum
support set to 0.1 by selecting only transactions containing certain types of reports.
1. i0 2 Contexts(ia, ib, iT) if $p1 2 Paths(ia, i0 ) ^ $p2 2 Paths(ib, i0 ) ^
Transactions # rules Avg. conf. Avg. lift % Multi-context
0 0
$p3 2 Paths(i , iT) (i is common reachable instance), and
All reports (588) 655 0.943 4.125 83
2. 9= i00 2 Contexts(ia, ib, iT) such that $pz 2 Paths(i0 , i00 ) (is least).
Top-5 excl. (585) 22 0.963 6.966 56
Top-12 excl. (438) 26 0.934 6.652 35

Example 2. Fig. 7 shows an example of IS represented as a graph


that is consistent with ontology fragment in Fig. 6. In this example,
diagnosed disease or to some family member disease. This ambigu- Contexts(DT1, DIAG1, PTN_XY21) = {VISIT1} because p1 = DT1.has-
ity makes it challenging for the system to automatically discover TherapyhasReport.VISIT1, p2 = DIAG1.hasReport.VISIT1 and
what the users’ intentions really are. As previously mentioned, p3 = VISIT1.hasVisit.PTN_XY21.
the user could remove this ambiguity by specifying in the SPARQL
extended query the exact relation of concepts in the ontology
Definition 9. An instance transaction associated to a query pattern
through pattern graph triples in the WHERE clause. However, this
Q and a target instance iT 2 IT(Q), is the set of instances Itrans # IS
task can be cumbersome and not always viable. In order to provide
such that:
the right sense to the query, the user can select the intended con-
text with the CONTEXT keyword attached to the appropriate con-
 "i 2 Itrans, $C0 2 features(Q), O [ IS  C(i) ^ C v C0 , and
cept. Alternatively, the system will build the transactions by
 $iC 2 IS, $p 2 Paths(i, iC), $p0 2 Paths(iC, iT), and O [ IS  MSC(iC) v
taking into account all the possible contexts.
context(Q).

4.2. Transaction extractor foundations


Notice that the number of transactions corresponds to the num-
ber of connected pairs (iT, iC).
Since our goal is to identify and construct transactions accord-
ing to the user’s mining pattern, in this section we present all
Example 3. Given the mining pattern (Patient, Visit, {Dis-
the definitions that sustain the method we have developed.
ease, Drug}), the set of valid instance transactions are {Malforma-
Let O be an ontology, Q a data mining pattern, and IS an instance
tion, Methotrexate, Arthritis} and {Malformation, Bad rotation,
store consistent w.r.t. O. The following definitions state the neces-
Systemic Arthritis, Methotrexate, Corticosteroids}.
sary elements to define the intended transactions over the IS.

Definition 5. The target instances for Q consist of the set IT(Q) = {i/
i 2 IS, O [ IS  CT(i) ^ CT v target(Q)}. 4.3. Transactions and items generation
In the running example, CT can be any subconcept of Patient and
IT(Q) is the set of all instances classified as Patient. Recall that a transaction is composed by a set of items co-
occurring within instances belonging to context(Q). The items are
Definition 6. We define the contexts of two concepts Ca and Cb in turn dynamically generated from instances belonging to
under CT, denoted with Contexts (Ca, Cb, CT), as the set of concepts C0 features(Q) and the user specified context. This section illustrates
satisfying the following conditions: how instance transactions are first generated and how the final
item transactions are later derived.
The generation of instance transactions from the instance store
(1) C0 2 Contexts(Ca, Cb, CT) if $ p1 2 Paths(Ca, C0 ) ^ $p2 2 Paths
is performed by first computing the composition triples involved
(Cb, C0 ) ^ $p3 2 Paths(C0 , CT), CT v target(Q) (C0 is common
by the mining pattern Q. Basically, a composition triple connects
reachable concept), and
each instance iT 2 IT(Q) to instances i0 2 features(Q) that are con-
= E 2 Contexts(Ca, Cb, CT) such that $pz 2 Paths(E, C0 ) (is least).
(2) 9
nected through a path p of intermediate instances. The path p must
contain just one instance iC belonging to some concept of
In this definition the second condition ‘‘is least’’ means that the
context(Q).
concept C0 is the nearest common reachable concept to Ca and Cb.
Composition triples are easily computed by depth-first tra-
versal, starting from the instances retrieved for target(Q) and stop-
Example 1. In Fig. 6, Contexts(DrugTherapy, Diagnosis) = {Visit}
ping when an instance of some concept of features(Q) is found.
because p1 = DrugTherapy.hasTherapy hasReport.Visit, p2 = Diag-
Then the resulting path is validated.
nosis.hasReport.Visit and p3 = Visit.hasVisit.Patient.
To generate the final transactions, i.e. those that will feed the
association mining algorithm, composition triples must be
Definition 7. Let i and i0 be two named instances of the instance grouped. More specifically, those composition triples with the
store IS. (r1    rn) is an aggregation path from i to i0 iff: same subject and the same path sliced at the position of the con-
text instance iC, are grouped together. The transaction will be gen-
1. There exists a list of property assertions rj(ij1, ij) 2 IS, 1 6 j 6 n. erated from the set of objects belonging to each group. Only
2. There exists two concepts C, C0 2 O such that O [ IS  C(i) ^ C0 (i0 ) instance transactions satifying Definition 9 are considered for gen-
and (r1    rn) 2 Paths(C, C0 ) erating the final item transactions.
V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62 59

The last step is to generate the item transactions that will serve 5.2. Experimental set-up
as input to the association mining algorithm. This step is described
in Algorithm 1. This algorithm first expands those instances having The generation of the transactions is guided by the user require-
some data type property in the feature’s set with the correspond- ments specified in the form of a mining pattern. In order to avoid
ing values asserted in the IS. Since the items are dependent on manual intervention in our experiments, we have automatically
the context, they include it in their definition. generated query patterns for 12 different feature concepts (e.g.
Disease, Procedure,Drug, and so on), and 3 concepts for contexts:
Algorithm 1. Generation of item transactions from instance Patient, Visit and Report. The target concept is always Patient. It is
transactions worth mentioning that the concept Report has 20 subconcepts pair-
wise disjoint, which correspond to different clinical reports of the
Require: An instance transaction T, the query pattern Q, the
HeC project.
ontology O, the instance store IS, and the composition
The current implementation of the transaction extractor has
triples ISQ.
been developed on top of the ontology indexing system proposed
Ensure: An item transaction T0 .
in [29], which also provides a simple reasoning mechanism over
T0 = ;
the ontology indexes. To prove the usefulness and quality of the
iC be the instance belonging to context(Q) within the
generated transactions we have a wide range of association rules
composition path associated to T in ISQ.
algorithms available. Recently, genetic algorithms (GAs) have been
for all i 2 T do
proposed for mining interesting association rules [4,3]. However,
if O [ IS  MSC(i) v Ci ^ Ci 2 features(Q)
since the focus of the work is not on the association rules algo-
^ MSC(i) v (Ci u $DTP) then
rithms, we evaluate the generated transactions with more tradi-
for all (i, DTP, value) 2 IS do
tional approaches.
update T0 with the item MSC(iC).MSC(i).DTP ? value
In order to obtain the association rules we make use of several
end for
utilities of the package arules of the R toolkit [16]. First, we calcu-
else
late the closed frequent itemsets [40], which reduces the total num-
update T0 with the item MSC(iC).MSC(i).i
ber of itemsets specially when transactions have sparse features.
end if
Then, we prune itemsets with cross support [37] lower than 0.7
end for
in order to filter out patterns that mix frequent and rare items,
return T0
since they tend to be spurious. Finally we apply the ruleInduction
function to derive rules from the previous itemsets with minimum
confidence set to 0.8.
Table 2 shows the composition triples for the mining pattern:
Q ¼ ðPatient; Report; fDisease; Drug; Report u 9damageIndexgÞ
5.3. Results
and the IS fragment shown in Fig. 7. Underlined instances belong to
concepts of context(Q). The item transactions derived from these The proposed method is able to generate transactions at differ-
composition triples are shown in Table 3. ent contexts specified by the user. Table 4 shows the three selected
contexts for the experiments along with the number of generated
transactions and their average length. The number and nature of
5. Experimental results the transactions obtained at every context is quite different and
will therefore influence the generated rules. We observe general
Several experiments have been carried out in this paper to contexts tend to generate less and longer transactions, which in
prove that our method is able to generate semantic-based transac- turn increases the probability of obtaining more rules. On the con-
tions from SW data, which later translates into high quality associ- trary, specific contexts generate smaller transactions, which hin-
ation rules. Notice the focus of the work is not on developing new ders the discovery of rules. This disparity in the nature of the
mining algorithms but on transforming and making complex transactions makes it necessary to adequately tune the minimum
semantic data amenable to existing ones. In the future, we will support threshold of each data set to be able to find interesting
investigate on new mining algorithms that profit from the seman- association rules.
tically-enriched items of the generated transactions. Fig. 8 analyzes the coverage of the generated rules with differ-
ent support thresholds as an indicator of their quality. The rules
obtained with the Patient transaction set achieve good coverage
5.1. Dataset rates with relatively high support thresholds. However, the other
obtained rule sets are not able to explain a high percentage of
In order to show the usefulness of our proposal, we test the the transactions. The intuition behind this is that the length of
method over a real-world instance store holding OWL annotations the latter transaction sets is shorter, which usually tends to de-
about patient’s follow-ups. These annotations have been generated crease the number of association rules discovered. In the case of
in the context of the Healt-e-Child project,6 and they are consistent the Report transaction set, the coverage is even lesser because
with an ontology similar to the one used as example in Fig. 2. The the transactions stem from different types of reports. Therefore,
structure of the semantic annotations is very heterogeneous and interesting rules might occur but with very low support thresholds.
holds information about 588 patients classified in three different In these cases it would be interesting to apply more sophisticated
groups according to their disease: Juvenile Ideopathic Arthritis mining algorithms that do not rely on the notion of support, as
(JIAPatient), heart disease (CardioPatient) and neurological disease those relying on GAs [4].
(NeuroPatient). The total number of semantic annotations is Fig. 9 shows the average confidence of the generated rules with
629.000, which gives more than 1000 semantic annotations per different support threshold values. We observe the confidence of
patient on average. the rules for the three transaction sets remains high even for low
support thresholds, which reasserts the quality of the obtained
6
http://www.healt-e-child.org/. rules.
60 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

Color Key Color Key


and Histogram and Histogram
20 40 60

80
Count

Count
40
0

0
0 0.2 0.4 0.6 0 0.1 0.3 0.5
Value Value

Disease Molecule

Finding Activities

Result Disease

Anatomy Result

Activities Attribute

Drug Finding

Attribute Drug

Symptom Anatomy

Molecule Cell

Immunolog Immunolog

Cell Symptom

Procedures Procedures
Drug

Drug
Procedures

Cell

Molecule

Symptom

Activities

Result

Finding

Disease

Procedures

Symptom

Cell

Finding

Result

Disease

Activities

Molecule
Attribute

Attribute
mmunology

mmunology
Anatomy

Anatomy
Color Key Color Key
and Histogram and Histogram
100

80
Count

Count
40
40
0

0 0.4 0.8 1.2 0 0.2 0.6 1


Value Value

Procedures Disease

Activities Finding

Finding Result

Molecule Activities

Immunolog Attribute

Disease Symptom

Attribute Drug

Anatomy Immunolog

Drug Anatomy

Result Molecule

Cell Cell

Symptom Procedures
Drug

Drug
Symptom

Cell

Result

Disease

Molecule

Finding

Activities

Procedures

Procedures

Cell

Molecule

Symptom

Activities

Result

Finding

Disease
Attribute

Attribute
mmunology

mmunology
Anatomy

Anatomy

Fig. 11. Heat maps showing the distribution of rules w.r.t. mining pattern features for different context concepts of type Report.

Based on the two previous measures (i.e. coverage and average the average /-coefficient shows a strong positive correlation in
confidence), we want to select a minimum support threshold for all the cases.
each transaction set and further analyze the quality of the obtained Table 6 shows the impact of applying certain restrictions (i.e.
rules. Fig. 10 shows the coverage multiplied by the average confi- selecting only specific report types) in the mining pattern. Each
dence (re-scaled). For each transaction set, we select the support row shows the resulting transactions and rules regarding all re-
threshold where both measures are maximized. ports, discarding the 5 more frequent reports and finally discarding
Table 5 shows the number of generated rules along with the 12 more frequent reports at the context of Patient. This exper-
their average confidence, the average lift and the /-coefficient iment shows how different patterns can be obtained by selecting
[35] for the three transaction sets. Notice that all generated the appropriate information through the mining pattern. In this ta-
rules have a high average confidence. Also, the more restricted ble, we have also included the percentage of rules that have items
the context is the stronger are the rules generated. Moreover, stemming from different reports. These rules are of special interest
V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62 61

Table 7
Examples of rules with patient as context.

Rule Supp. Conf. Lift


{JIAPatient. (Disease.disease)->oligoarthritis, 0.26 1.0 18.833
JIAPatient. (Finding.sacroiliac_tenderness & Anatomy.sacroiliac_joint & MedicalProcedure.presence)} =>
{JIAPatient. (Symptom.inflammatory_pain)}}
{CardioPatient. (MedicalProcedure.genetic_research & ExternalActivities.genetic_molecular), 0.107 1.0 9.187
CardioPatient. (Molecular.tbx5)} =>
{CardioPatient. (Molecular.5_nkx2)}
{NeuroPatient. (ExternalActivities.endocrinology)} => 0.107 0.969 8.380
{NeuroPatient. (MedicalProcedure.resonance_magnetic_material_imaging_contrast &Chemical.metal)}
{JIAPatient. (Symptom.tenderness & Quantity.score)->1} => 0.144 1.0 6.917
{JIAPatient. (Finding.pain)->1}
{CardioPatient. (Finding.ejection_systolic_murmur), 0.108 0.914 5.485
CardioPatient. (Symptom.cyanosis)} =>
{CardioPatient. (Anatomy.border_sternal)}

Table 8
Examples of rules with visit as context.

Rule Supp. Conf. Lift


{CardioVisit. (MedicalProcedure.auscultation_lung), 0.074 1.0 11.950
CardioVisit. (Quantity.compensation)} =>
{CardioVisit. (Finding.breath_sounds & Quality.equality)}
{JIAVisit. (Chemical.methotrexate)->Weekly} => 0.08 0.990 7.197
{JIAVisit. (Chemical.methotrexate)}
{JIAVisit. (Finding.active_wrist & Disease.arthritis_wrist), 0.154 1.0 6.451
JIAVisit. (PatientGroup.parents_legal & MedicalProcedure.data_collection & Result.consent),
JIAVisit. (Quality.inclusion_criteria)} =>
{JIAVisit. (Disease.idiopathic_juvenile_arthritis & ExternalActivities.rheumatology

Table 9
Examples of rules with Rheumatology reports as context.

Rule Supp. Conf. Lift


{JIADiagnosis. (Finding.fever)} => 0.010 0.812 57.439
{JIADiagnosis. (Finding.erythematous_rash)}
{Surgery. (Finding.surgery_performed), 0.012 0.815 46.137
Surgery. (MedicalProcedure.resonance_magnetic_material_imaging_contrast & Chemical.metal)} =>
{Surgery. (MedicalProcedure.tumour_removal)}
{DiagnosisCollectingForm. (ExternalActivities. endocrinology), 0.014 0.808 40.34
DiagnosisCollectingForm. (Finding. neurological_findings),
DiagnosisCollectingForm. (MedicalProcedure. resonance_magnetic_material_imaging_contrast & Chemical.metal)} =>
{DiagnosisCollectingForm. (MedicalProcedure. tomography_computed)->Abnormal}
{JIALaboratoryOPBG. (Immunology.antibody_antinuclear)} => 0.0133 0.85 37.951
{JIALaboratoryOPBG. (Immunology.rheumatoid_factor)}
{CMPPatientGeneralInfo. (Finding.hospital_birth), 0.010 0.838 14.926
CMPPatientGeneralInfo. (MedicalProcedure. procedures_therapeutic), procedures_therapeutic),
CMPPatientGeneralInfo. (PatientAttribute. pregnancy_term_full),
CMPPatientGeneralInfo. (Symptom.cyanosis)} =>
{CMPPatientGeneralInfo. (Quality.provisional_diagnosis)-> postop_ToF}

for clinicians as they can find associations between clinical relating procedures with activities. Finally, in the fourth heatmap
variables collected from different departments. Notice that the most of the rules relate procedures with diseases and findings.
likelihood of finding multi-report associations decreases with the These maps can also be used to explore and select the rules of
frequency of the reports. interest.
In order to show the variety of patterns that can be found by Finally, Tables 7–9 show some interesting rules obtained for the
selecting different contexts, we have summarized the obtained Patient, Visit and Report contexts, respectively. Notice that most of
rules in terms of the features they contain. Summaries are pre- these rules are self-explained and we consider most of them of
sented in form of heatmaps, where each cell represents a pair of interest for the clinicians.
features, and its color indicates the percentage of rules containing
that pair. Fig. 11 shows the heatmaps generated for different con-
text concepts of type Report. It can be noticed that by considering 6. Conclusions
different contexts we obtain rules of very different nature. For
example, the first heatmap shows a high percentage of rules relat- We have presented a novel method for mining association rules
ing procedures with diseases and findings. The second heatmap from heterogeneous semantic data repositories expressed in RDF/
shows a high percentage of rules relating procedures with (S) and OWL. To the best of our knowledge, this problem has only
molecules. In the third heatmap the prevailing rules are the ones been considered to a minor extent. The intuition under the method
62 V. Nebot, R. Berlanga / Knowledge-Based Systems 25 (2012) 51–62

developed is to extract and combine just the interesting instances [13] B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using
frequent itemsets, in: D. Barbará, C. Kamath (Eds.), Proceedings of the Third
(i.e. features) from the whole repository and flatten them into tra-
SIAM International Conference on Data Mining, SIAM, 2003, pp. 59–70.
ditional transactions while capturing the implicit schema-level [14] A. García, R. Berlanga, R. Dánger, A description clustering data mining
knowledge encoded in the ontology. Then, existing association technique for heterogeneous data, Commun. Comput. Inform. Sci., Softw.
rules algorithms can be applied. We believe this type of learning Data Technol. 10 (2008) 361–373.
[15] P. Giannikopoulos, I. Varlamis, M. Eirinaki, Mining frequent generalized
will become increasingly important in future research both from patterns for web personalization in the presence of taxonomies, IJDWM 6
the machine learning as well as from the SW communities. Initial (1) (2010) 58–76.
experiments on real world SW data enjoy promising results and [16] M. Hahsler, B. Gruen, K. Hornik, Arules – a computational environment for
mining association rules and frequent item sets, J. Stat. Softw. 14 (15) (2005)
show the usefulness of our approach. As future work, we would 1–25.
like to apply generalized query patterns by using the ontology axi- [17] J. Hartmann, Y. Sure. A knowledge discovery workbench for the semantic web,
oms, as well as to automatically discover interesting contexts and in: Workshop on Mining for and from the Semantic Web at the ACM SIGKDD,
August 2004.
their association rules. Moreover, our method could be applied in a [18] C. Kiefer, A. Bernstein, A. Locher, Adding data mining support to SPARQL via
variety of different scenarios such as [34,41,39], where the mining statistical relational learning methods, in: S. Bechhofer, M. Hauswirth, J.
tasks are transaction oriented. An interesting issue for future work Hoffmann, M. Koubarakis (Eds.), ESWC, Lecture Notes in Computer Science,
vol. 5021, Springer, 2008, pp. 478–492.
is to use the knowledge encoded in the ontology in order to filter [19] K. Kochut, M. Janik, SPARQLeR: extended SPARQL for semantic association
and prune discovered rules, and also to express the user goals, sim- discovery, in: ESWC, LNCS, vol. 4519, Springer, 2007, pp. 145–159.
ilar to the work in [24,7]. Another important direction worth [20] M. Kuramochi, G. Karypis, Frequent subgraph discovery, in: N. Cercone, T.Y.
Lin, X. Wu (Eds.), ICDM, IEEE Computer Society (2001) 313–320.
exploring concerns the combination of clustering and association
[21] B. Lent, A. Swami, J. Widom, Clustering association rules, Proc. ICDE’97 (1997)
mining algorithms to summarize document collections. This tech- 220–231.
nique was formerly introduced in [13] through the Frequent Item- [22] Y. Li, C. Yu, H.V. Jagadish, Schema-free XQuery, in: VLDB ’04: Proceedings of the
set based Hierarchical Clustering (FIHC). Basically, the FICH 30th International Conference on Very Large Data Bases, VLDB Endowment,
2004, pp. 72–83.
algorithm generates clusters from frequent itemsets, which in turn [23] F.A. Lisi, F. Esposito, Mining the semantic web: a logic-based methodology, in:
constitute the cluster descriptors. Several enhancements of this ISMIS, LNCS, vol. 3488, Springer, 2005, pp. 102–111.
algorithm have been proposed since then (e.g. [41]). Recently, [5] [24] C. Marinica, F. Guillet, Knowledge-based interactive postmining of association
rules using ontologies, IEEE Trans. Knowledge Data Eng. 22 (6) (2010) 784–
proposed a novel approach also based on frequent item pairs that 797.
provides more homogeneous clusters and better descriptions than [25] R.J. Miller, Y. Yang, Association rules over interval data, SIGMOD Rec. 26 (2)
those obtained with FIHC. Alternative research lines, which are out (1997) 452–461.
[26] S. Muggleton, L.D. Raedt, Inductive logic programming: theory and methods, J.
of the scope of the present work, consist in applying more sophis- Log. Program. 19 (20) (1994) 629–679.
ticated data mining algorithms to the generated transactions [4,3] [27] T. Näppilä, K. Järvelin, T. Niemi, A tool for data cube construction from
and study their performance. Equally interesting is to devise new structurally heterogeneous xml documents, JASIST 59 (3) (2008) 435–449.
[28] V. Nebot, R. Berlanga, Mining association rules from semantic web data, Proc.
data mining algorithms that take profit from the semantically- IEA/AIE (2010) 0–10.
enriched items of the generated transactions. [29] V. Nebot, R.B. Llavori, Efficient retrieval of ontology fragments using an
interval labeling scheme, Inf. Sci. 179 (24) (2009) 4151–4173.
[30] T. Niemi, T. Näppilä, K. Järvelin, A relational data harmonization approach to
References
xml, J. Inf. Sci. 35 (5) (2009) 571–601.
[31] S. Ramaswamy, S. Mahajan, A. Silberschatz. On the discovery of interesting
[1] Workshop on Mining for and from the Semantic Web (MSW-04), 2004. patterns in association rules, in: Proceedings of the 24th International Conference
[2] R. Agrawal, T. Imielinski, A.N. Swami, Mining association rules between sets of on Very Large Data Bases, VLDB ’98, San Francisco, CA, USA, 1998, pp. 368–379.
items in large databases, SIGMOD Conference, ACM Press, 1993. pp. 207–216. [32] R. Srikant, R. Agrawal, Mining generalized association rules, VLDB (1995) 407–
[3] B. Alatas, E. Akin, An efficient genetic algorithm for automated mining of both 419.
positive and negative quantitative association rules, Soft Comput. 10 (3) (2006) [33] G. Stumme, A. Hotho, B. Berendt, Semantic web mining: state of the art and
230–237. future directions, Web Semantics: Sci. Services Agents World Wide Web 4 (2)
[4] J. Alcala-Fdez, N. Flugy-Pape, A. Bonarini, F. Herrera, Analysis of the (2006) 124–143.
effectiveness of the genetic algorithms based on extraction of association [34] A. Tagarelli, S. Greco, Semantic clustering of xml documents, ACM Trans. Inf.
rules, Fundam. Inf. 98 (2010) 1–14. Syst. 28 (1) (2010).
[5] H. Anaya-Sánchez, A. Pons-Porrata, R. Berlanga, A document clustering [35] P.-N. Tan, V. Kumar, J. Srivastava, Selecting the right objective measure for
algorithm for discovering and describing topics, Pattern Recognit. Lett. 31 (6) association analysis, Inf. Syst. 29 (2004) 293–313.
(2010) 502–510. [36] M.-C. Tseng, W.-Y. Lin, R. Jeng, Updating generalized association rules with
[6] F. Baader, D. Calvanese, D.L. McGuinness, D. Nardi, P.F. Patel-Schneider (Eds.), evolving taxonomies, Appl. Intell. 29 (3) (2008) 306–320.
The Description Logic Handbook: Theory, Implementation, and Applications, [37] H. Xiong, P.-N. Tan, V. Kumar, Mining strong affinity association patterns in
Cambridge University Press, 2003. data sets with skewed support distribution, Proceedings of the Third IEEE
[7] K. Becker, M. Vanzin, O3R: Ontology-based mechanism for a human-centered International Conference on Data Mining, ICDM ’03, IEEE Computer Society,
environment targeted at the analysis of navigation patterns, Know.-Based Syst. Washington, DC, USA, 2003. p. 387.
23 (2010) 455–470. [38] Y. Xu, Y. Papakonstantinou, Efficient keyword search for smallest LCAs in XML
[8] S. Bloehdorn, Y. Sure, Kernel methods for mining instance data in ontologies, databases, SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD International
in: ISWC/ASWC, LNCS, vol. 4825, Springer, 2007, pp. 58–71. Conference on Management of Data, ACM, New York, NY, USA, 2005. pp. 527–
[9] P. Buitelaar, P. Cimiano, B. Magnini (Eds.), Ontology Learning from Text: 538.
Methods, Evaluation and Applications, Frontiers in Artificial Intelligence and [39] U. Yun, K.H. Ryu, Approximate weighted frequent pattern mining with/
Applications, vol. 123, IOS Press, Amsterdam, 2005. without noisy environments, Know.-Based Syst. 24 (2011) 73–82.
[10] Y. Chi, R.R. Muntz, S. Nijssen, J.N. Kok, Frequent subtree mining – an overview, [40] M.J. Zaki, Mining non-redundant association rules, Data Min. Knowl. Discov. 9
Fundam. Inform. 66 (1–2) (2005) 161–198. (2004) 223–248.
[11] R. Dánger, J. Ruiz-Shulcloper, R.B. Llavori, Objectminer: a new approach for [41] W. Zhang, T. Yoshida, X. Tang, Q. Wang, Text clustering using frequent
mining complex objects, ICEIS 2 (2004) 42–47. itemsets, Knowl.-Based Syst. 23 (5) (2010) 379–388.
[12] N. Fanizzi, C. d’Amato, F. Esposito, Metric-based stochastic conceptual
clustering for ontologies, Inf. Syst. 34 (8) (2009) 792–806.

View publication stats

You might also like