Technical Report. November 2017.
arXiv:1804.05639v1 [cs.DB] 16 Apr 2018
NELL2RDF: Reading the Web,
and Publishing it as Linked Data
José M. Giménez-Garcı́a1, Maı́sa Duarte1 , Antoine Zimmermann2
Christophe Gravier1, Estevam R. Hruschka Jr.3,4 , and Pierre Maret1
1
2
Univ Lyon, UJM-Saint-Étienne, CNRS, Laboratoire Hubert Curien
UMR 5516, F-42023 Saint Étienne, France
{jose.gimenez.garcia, maisa.duarte, christophe.gravier,
pierre.maret}@univ-st-etienne.fr
Univ Lyon, MINES Saint-Étienne, CNRS, Laboratoire Hubert Curien
UMR 5516, F-42023 Saint-Étienne, France
[email protected]
3
Federal University of Sao Carlos - UFSCar, São Carlos, Brazil
4
Carnegie Mellon University - CMU, Pittsburgh, United States
[email protected]
Abstract. NELL is a system that continuously reads the Web to extract knowledge in form of entities and relations between them. It has
been running since January 2010 and extracted over 50,000,000 candidate
statements. NELL’s generated data comprises all the candidate statements together with detailed information about how it was generated.
This information includes how each component of the system contributed
to the extraction of the statement, as well as when that happened and
how confident the system is in the veracity of the statement. However,
the data is only available in an ad hoc CSV format that makes it difficult
to exploit out of the context of NELL. In order to make it more usable for
other communities, we adopt Linked Data principles to publish a more
standardized, self-describing dataset with rich provenance metadata.
Keywords: NELL, RDF, Semantic Web, Linked Data, Metadata, Reification
1
Introduction
Never-Ending Language Learning (NELL) [3, 16] is an autonomous computational system that aims at continually and incrementally learning. NELL has
been running for about 7 years in Carnegie Mellon University (US). Currently,
NELL has collected over 50 million of candidate beliefs, from with about 3.6
million have been promoted as trustworthy statements. NELL learns from the
web and uses an ontology previously created to guide the learning. One of the
most significant resource contributions of NELL, in addition to the millions of
beliefs learned from the Web, is NELL’s internal representation (or metadata)
for categories, relations and concepts. Such internal representation grows in every iteration, and is used by NELL as a set of different (and constantly updated)
1
Technical Report. November 2017.
feature vectors to continuously retrain NELL’s learning components and build
its own way to understand what is read from the Web. Zimmermann et al. [24]
published in 2013 a solution to convert NELL’s beliefs and ontology into RDF
and OWL. However, NELL’s internal metadata is not modeled in their work.
Thus, the main contribution of this work is to extended the approach to include
all the provenance metadata (NELL’s internal representation) for each belief. We
publish this data using five different representation models: RDF reification [2,
Sec. 5.3], N-Ary relations [19], Named Graphs [5], Singleton Properties [18], and
NdFluents [10]. In addition, we publish not only the promoted beliefs, but also
the candidates. As far as we know, this dataset contains more metadata about
the statements than any other available dataset in the linked data cloud. This
in itself can also be interesting for researchers that seek to manage and exploit
meta-knowledge.
Our intention is to keep this information updated and integrate it on NELL’s
web page5 .
The rest of the paper is organized as follows: Section 2 presents NELL and
the components it comprises; in Section 3 describes the transformation of NELL
data and metadata to RDF; Section 4 presents the dataset generated in this
paper and how it is published; finally, Section 5 provides final remarks and
future work.
2
The Never-Ending Language Learning System
NELL [3, 16] was built based on a new Machine Learning (ML) paradigm, the
Never-Ending Learning (NEL). NEL paradigm is a semi-supervised learning [1]
approach focused on giving the ability to a machine learning system to autonomously use what it has previously learned to continuously become a better
learner. NELL is based on a number of coupled components working in parallel.
These components read the web and use different approaches to, not only infer
new knowledge in the form of beliefs, but also to infer new ways of internally
representing the learned beliefs and their properties. Beliefs are divided into candidates and promoted beliefs. In order to be promoted a belief needs to have a
confidence score of at least 0.9.
1. AliasMatcher finds relations between entities and their Wikipedia URL on
Freebase. It was run only once and is currently not active.
2. CML (Coupled Morphologic Learner) [4] is responsible for identifying morphological regularities (such as that words finished in burg could be cities). It
makes use of orthographic features of noun phrases (e.g., length and number
of words, capitalization, prefixes and suffixes).CMC is the previous version
of this component.
3. CPL (Coupled Pattern Learner) [4] is the component that learns Named Entities (NE) and Textual Patterns (TP) from text in the web pages. Internally,
a different implementation was used between 2010 and 2013 that could learn
5
http://rtw.ml.cmu.edu/
2
Technical Report. November 2017.
4.
5.
6.
7.
8.
9.
10.
11.
12.
categories and relations together. After that, CPL was splitted in CPL1 and
CPL2, the former learning categories and the latter relations, but the distinction is not made in the knowledge base. All the knowledge from CPL1 is
promoted promoted only if CPL2 agrees. i.e., CPL will extract TPs for categories ( is a city, city such as , etc.) and for relations (arg1 is a
city located in arg2, arg1 is the capital of arg2, etc.). Then, using those TPs, CPL will extract NEs for categories (e.g. city(Paris),
city(Annecy), etc.) and NE pairs for relations (locatedIn(Paris, France),
locatedIn(Annecy, France), etc.).
KbManipulation is used to correct some old bugs from NELL’s internal
indexing knowledge. Several of these bugs should be removed automatically,
but NELL has not one automated process for this task yet.
LatLong matches the literal string of Named Entities against a fixed geolocation database.
LE (Learned Embeddings) [23] predicts new categories or relations of
entities based on Event and Named Entity extraction It creates a feature
space where each dimension is a single NELL predicate, and NELL’s learned
NE (or NE pairs for relations) is used as training examples. LE’s process
predicts category or relation for NE (or NE pairs) that were not related in
the training set.
MBL, also known as ErrorBasedIntegrator and Knowledge Integrator, is the
component responsible for taking the decision of promotion based on the
contributions of the other components. EntityResolverCleanup is the name
used for the same MBL process applied during a big alteration in NELL’s
knowledge base. In 2010 a big change was made in the NELL’s KB structure
to make possible for two words to have different meanings (e.g apple the
fruit and Apple the company) and, conversely, for a concept to use different
words (e.g Google and Google Inc.).
OE (Open Eval) [21] queries the web and extract small text using predicate
instances. OE calculates the score based on the text distance between the
instances in a relation.
OntologyModifier is used for any ontology alteration. This component
appears in the Knowledge base when a new seed or and ontology extension
is manually introduced.
PRA (Path Ranking Algorithm) [9] is based on Random Walk Inference.
PRA analyzes the connections between two categories instances which are
the arguments for a relation. This component replaced the old Ruler Learner
component.
RL (Rule Learner) [13] extracts new knowledge using Horn Clauses based
on the ontology. Its implementation was based on FOIL [20]. It can be found
in NELL’s KB, but its execution stopped when NELL started to deal with
polysemy resolution.
SEAL (Coupled Set Expander for Any Language) [22] is the component
responsible for extracting knowledge from HTML patterns. It works in a
similar way to CPL, but using HTML patterns instead of textual patterns.
3
Technical Report. November 2017.
In the past it was called CSEAL, but after some improvements in its performance it changed the name for SEAL.
13. Semparse [12] combines syntactic parsing from CCGbank (a conversion of
the corpus of trees Penn Treebank [15]) and distant supervision.
14. SpreadsheetEdits provides modifications in the NELL’s Knowledge base
using human feedback.
Each of of these components, with the exception of LE, output provenance
information regarding theirs execution. In the next sections we present how this
metadata is modeled in RDF.
3
Converting NELL to RDF
In this section we describe how NELL data and metadata are transformed into
RDF. The first subsection presents how NELL’s ontology and beliefs are converted, following the work by Zimmermann et al. [24]; the second subsection
describes how we convert the provenance metadata associated with each belief.
NELL’s Knowledge bases used in this paper for the promoted and candidates
beliefs are respectively corresponding to the iterations 10756 and 10707. The
code is publicly available in GitHub8 .
3.1
Converting NELL’s beliefs to RDF
NELL’s ontology is published as a file with three tab separated values per line,
where each line expresses a relationship between categories and other categories,
relations, or values used by NELL processes. In order to convert NELL’s ontology
to RDF each line is transformed into a triple as per Zimmermann et al. [24]. In
short, the first and the third values are a pair of categories or relations, or either
a category or relation in the first field and a value in the third. The second field
is a predicate that indicates the relationship between the two elements. The
transformations can be seen in Table 1.
NELL’s beliefs are also published in tab-separated format, where each line
contains a number of fields to express the belief and the associated metadata,
such as iteration of promotion, confidence score, or the activity of the components
that inferred the belief. All the fields except 4, 5, 6, and 13 are used to convert the
beliefs into RDF statements. Table 2 shows the meaning of each field. Fields 1, 2,
and 3 are converted into the subject, predicate, and object of an RDF statement;
the content of fields 7 and 8 create new statements using rdf:label properties;
fields 9 and 10 create new triples with the property skos:prefLabel; finally,
fields 11 and 12 are used to create triples indicating the types of the subject and
the object. For a more detailed description of this step, refer to Zimmermann
et al. [24].
6
7
8
http://rtw.ml.cmu.edu/resources/results/08m/NELL.08m.1075.esv.csv.gz
http://rtw.ml.cmu.edu/resources/results/08m/NELL.08m.1070.cesv.csv.gz
https://github.com/WDAqua/nell2rdf
4
Technical Report. November 2017.
Table 1: NELL’s ontology predicates and their translation in RDFS / OWL (from
[24])
NELL predicate Translation to RDFS / OWL
antireflexive
rdf:type owl:IrreflexiveProperty
antisymmetric
antisymmetric Literal(?object,xsd:boolean)
description
rdfs:comment Literal(?object,@en)
rdfs:domain Class(?object)
domain
domainwithinrange domainWithinRange Literal(?object,xsd:boolean)
generalizations
rdfs:subClassOf Class(?object)
humanformat
humanFormat Literal(?object,xsd:string)
instancetype
instanceType IRI(?object)
inverse
owl:inverseOf ?object
memberofsets
if ?object is rtwcategory then rdf:type rdfs:Class
else ?object is rtwrelation then rdf:type rdf:Property
mutexpredicates
if ?subject is a class then owl:disjointWith ?object
else ?subject is a property then owl:propertyDisjointWith ?object
nrofvalues
if ?object is 1 then rdf:type owl:FunctionalProperty
populate
populate Literal(?object,xsd:boolean)
rdfs:range ?object
range
rangewithindomain rangeWithinDomain Literal(?object,xsd:boolean)
visible
visible Literal(?object,xsd:boolean)
#
1
2
3
4
5
6
7
8
9
10
11
12
13
Table 2: Description of NELL’s beliefs fields
Description
Subject of the belief
Predicate of the belief
Object of the belief
Iteration when the belief was promoted, or a list of iterations
when the components generated the belief
Probability
Confidence score of the belief
Source
MBL activity to promote the belief
Entity literalStrings
Labels of the subject
Value literalStrings
Labels of the object
Best Entity literalString Preferred label of the subject
Best Value literalString Preferred label of the object
Categories for Entity
Classes of the subject
Categories for Value
Classes of the object
Candidate Source
Activity of the components that generated the belief
Field
Entity
Relation
Value
Iteration
5
Technical Report. November 2017.
3.2
Converting NELL metadata to RDF
Fields 4, 5, 6, and 13 of each NELL’s belief are used to extract the metadata.
Each belief is represented by a resource, to which we attach the provenance information. In the promoted beliefs process, field 4 is used to extract the iteration
when the belief was promoted, while field 5 gives a confidence score about it.
On the other hand, in the candidate beliefs process, fields 4 and 5 contains the
iterations when each component generated information about the belief, and the
confidence score provided by each of them. Field 6 contains a summary information about the activity of MBL when processing the promoted belief. The
complete information from field 6 is a summary of field 13. For that reason,
we only process field 13. Finally, in field 13 every activity that took part in
generating the statement is parsed.
The ontology can be seen in Figure 1. We make use of the PROV-O ontology [14] to describe the provenance. Each Belief can be related with one
or more ComponentExecution that, in turn, are performed by a Component.
If the belief is a PromotedBelief, it has attached its iterationOfPromotion
and probabilityOfBelief. The ComponentIteration is related to information
about the process: the iteration, probabilityOfBelief, Token, source and
atTime (the date and time it was processed). The Token expresses the concepts
that the Component is relating. Those concepts can be a pair of entities for a
RelationToken, and entity and a class for a GeneralizationToken (note that
LatLong component has a different token GeoToken, further described later).
Finaly, each component have a source string describing their process for the
belief. This string is then further analyzed and translated into a different set if
IRIs for each type of component in the subsections below.
The classes of the ontology are described in Table 3 and properties of the
ontology are described in Table 4. The classes and properties of each component
are described down below.
Table 3: Description of NELL metadata classes
rdfs:subClassOf
Description
Class
Belief
prov:Entity
A belief
PromotedBelief
Belief
A promoted belief
CandidateBelief
Belief
A candidate belief
ComponentExecution prov:Activity
The activity of a component in an iteration
prov:SoftwareAgent A component
Component
Token
owl:Thing
The tuple that was inferred by the activity
RelationToken
Token
The tuple <Entity,Entity> that was
inferred for a relation
GeneralizationToken Token
The tuple <Entity,Category> that was
inferred for a generalization
GeoToken
Token
The tuple <Entity,Longitude,Latitude>
that was inferred for a geografical belief
6
Fig. 1: NELL2RDF metadata ontology
Technical Report. November 2017.
7
Technical Report. November 2017.
Table 4: Description of NELL metadata properties
rdfs:subPropertyOf
rdfs:domain
rdfs:range
Description
generatedBy
prov:wasGeneratedBy
Belief
ComponentIteration
The Belief was generated by the iteration of the component
associatedWith
prov:wasAssociatedWith ComponentIteration Component
The iteration was performed by the component
iterationOfPromotion owl:DatatypeProperty PromotedBelief
xsd:integer
iteration in which the component was promoted
probabilityOfBelief owl:DatatypeProperty PromotedBelief
xsd:decimal
Confidence score of the Belief
owl:DatatypeProperty ComponentIteration xsd:integer
iteration
Iteration in which a component performed the activity
probability
owl:DatatypeProperty ComponentIteration xsd:decimal
Confidence score given by the component
hasToken
owl:ObjectProperty
ComponentIteration Token
The concepts that the component is relating
source
owl:DatatypeProperty ComponentIteration xsd:string
Data that was used by the component in the activity
atTime
owl:DatatypeProperty ComponentIteration xsd:dateTime
Date and time when the component execution was performed
tokenEntity
owl:DatatypeProperty Token
xsd:string
Entity on which the data was inferred
relationValue
owl:DatatypeProperty RelationToken
xsd:string
Entity related the entity appointed by tokenEntity
generalizationValue owl:DatatypeProperty GeneralizationToken xsd:string
Class of the entity appointed by tokenEntity
Property
8
Technical Report. November 2017.
AliasMatcher execution is denoted by a resource of class AliasMatcherExecution,
and includes the date when the data was extracted from Freebase using the property freebaseDate. The added ontology can be seen in Figure 2.
Fig. 2: AliasMatcherExecution metadata ontology
CMC execution is denoted by a resource of class CMCExecution. A number of
morphological patterns MorphologicalPatternScoreTriple are attached to it,
each one containing a name, a value, and a confidence score. The properties used
can be seen in Table 5, while the ontology diagram is shown in Figure 3.
Table 5: Description of CMC metadata properties
rdfs:domain
rdfs:range
Description
morphologicalPattern
CMCExecution
MorphologicalPatternScoreTriple
One of the morphological patterns used by CMC
morphologicalPatternName MorphologicalPatternScoreTriple xsd:string
Name of the morphological pattern (i.e., prefix, suffix, etc.)
morphologicalPatternValue MorphologicalPatternScoreTriple xsd:string
Value of the morphological pattern (i.e., prefix = Saint and suffix = burgh)
morphologicalPatternScore MorphologicalPatternScoreTriple xsd:decimal
Score of the morphological pattern
Property
CPL execution is denoted by a resource of class CPLExecution. It contains
a series of textual patterns patternOccurrences, each one with a literal that
describes the pattern, and the number of times it has occurred in the NELL’s
data source. The properties used are described in Table 6, and the diagram for
the ontology is shown in Figure 4.
KbManipulation execution is denoted by a resource of class KbManipulationExecution.
Ir contains the bug oldBug that was manually fixed. Its shown in Figure 5.
9
Technical Report. November 2017.
Fig. 3: CMC metadata ontology
Table 6: Description of CPL metadata properties
rdfs:domain
rdfs:range
Description
patternOccurrences CPLExecution
PatternNbOfOccurrencesPair
One of the textual patterns used by CPL
textualPattern
PatternNbOfOccurrencesPair xsd:string
Textual pattern in the form of a sentence
nbOfOccurrences
PatternNbOfOccurrencesPair xsd:nonNegativeInteger
Number of times it has occurred in the NELL’s source data
Property
Fig. 4: CPL metadata ontology
10
Technical Report. November 2017.
Fig. 5: KbManipulation metadata ontology
LatLong execution is denoted by a resource of class LatLongExecution. It contains a list of locations NameLatLongTriple that were used to infer the belief.
Each one containing the name and the latitude and longitude values. This execution has also its own token GeoToken with the latitude and longitude values
reusing the same properties. The properties are detailed in Table 7, and the
ontology diagram is shown in Figure 6.
Table 7: Description of LatLong metadata properties
rdfs:domain
rdfs:range
Property
Description
location
LatLongExecution NameLatLongTriple
One of the locations used by Latlong
name
NameLatLongTriple rdf:langString
Name of the location
latitudeValue NameLatLongTriple xsd:decimal
Latitude of the location
longitudeValue NameLatLongTriple xsd:decimal
Longitude of the location
LE execution is denoted by a resource of class LEExecution. It does not contain
any additional triples.
MBL execution is denoted by a resource of class MBLExecution. It contains the
entities and the categories of the other belief that was used to promote this one.
The properties used are described in Table 8, and the ontology diagram is shown
in Figure 7.
OE execution is denoted by a resource of class OEExecution. It contains a set of
pairs TextUrlPair, each one including the sentence that was used to infer the
belief, and the URL from where it was extracted. The properties used can be
found in Table 9, and the ontology diagram in Figure 8.
11
Technical Report. November 2017.
Fig. 6: LatLong metadata ontology
Table 8: Description of MBL metadata properties
rdfs:domain rdfs:range
Description
promotedEntity
MBLExecution xsd:string
Entity of a belief previously promoted
promotedEntityCategory MBLExecution xsd:string
Category of the entity of the promoted belief
promotedRelation
MBLExecution xsd:string
Relation of the promoted belief
promotedValue
MBLExecution xsd:string
Value of the promoted belief
promotedValueCategory MBLExecution xsd:string
Category of the promoted belief, if applicable
Property
Fig. 7: MBL metadata ontology
12
Technical Report. November 2017.
Table 9: Description of OE metadata properties
Property rdfs:domain rdfs:range
Description
textUrl OEExecution TextUrlPair
One of the pairs <text, url> used by OE
text
TextUrlPair rdf:langString
Text extracted from the web
url
xsd:anyURI
Web page where the text was extracted
Fig. 8: OE metadata ontology
13
Technical Report. November 2017.
OntologyModifier execution is denoted by a resource of class OntologyModifierExecution.
It contains the ontologyModification, which can be either a modification of
a category or a modification of a relation. The ontology diagram can be seen in
Figure 9.
Fig. 9: OntologyModifier metadata ontology
PRA execution is denoted by a resource of class PRAExecution. It includes a
series of Path resources describing the path followed in NELL dataset to infer
the belief. Each Path includes its direction and a confidence score, along with a
list of relations followed. The properties used can be seen in Table 10, while the
ontology diagram is shown in Figure 10.
Table 10: Description of PRA metadata properties
rdfs:domain rdfs:range
Property
Description
relationPath
PRAExecution Path
Relation path that entails the belief
Path
DirectionOfPath
direction
Direction of the path
score
Path
xsd:decimal
Score assigned to the entailment
rdf:List
listOfRelations Path
Ordered list of relations in the path
RL execution is denoted by a resource of class RLExecution. It contains a resource RuleScoresTuple that contains the Rule and a set of scores indicating
the confidence, and the number of beliefs that are estimated to be correctly and
incorrectly inferred (and the number of inferred beliefs for which it is not known
if they are correct or not) with that rule. The rule itself contains the variables
14
Technical Report. November 2017.
Fig. 10: PRA metadata ontology
and their values, and the predicates that are part of it. Each Predicate includes the name of the predicate and the two variables it uses. The complete
list of properties can be found in table 11. The ontology diagram is presented in
Figure 11.
SEAL execution is denoted by a resource of class SEALExecution. It includes
the URL it used with the property url. The ontology diagram can be seen in
Figure 12.
Semparse execution is denoted by a resource of class SemparseExecution. It
includes a literal with the sentence used during it, using the property sentence.
The ontology diagram can be seen in Figure 13.
SpreadsheetEdits execution is denoted by a resource of class SpreadsheetEditsExecution.
It contains a set of literals describing the user who made the modification, the
file used as input, the action made, and the modified entity, relation, and value.
The list of properties can be seen in Table 12, while the ontology diagram is
shown in Figure 14.
4
The NELL2RDF Dataset
The current version of NELL2RDF updates the promoted beliefs to the last
version, adding the provenance triples about them. It also adds the candidate
beliefs and their corresponding provenance triples. We provide the dumps for the
promoted beliefs9 and the candidate beliefs10 . The ontologies for the beliefs11
9
10
11
https://w3id.org/nellrdf/nellrdf.promoted.n3.gz
https://w3id.org/nellrdf/nellrdf.candidates.n3.gz
https://w3id.org/nellrdf/ontology/nellrdf.ontology.n3
15
Technical Report. November 2017.
Table 11: Description of RL metadata properties
rdfs:domain
rdfs:range
Description
ruleScores
RLExecution
RuleScoresTuple
The rule and set of scores used by RL
RuleScoresTuple Rule
rule
The rule RL used to infer the belief, in the form of horn clauses
accuracy
RuleScoresTuple xsd:decimal
Estimated accuracy of the rule in NELL
RuleScoresTuple xsd:nonNegativeInteger
nbCorrect
Estimated number of correct beliefs created by the rule
nbIncorrect
RuleScoresTuple xsd:nonNegativeInteger
Estimated number of incorrect beliefs created by the rule
RuleScoresTuple xsd:nonNegativeInteger
nbUnknown
Number of rules created by the rules with no known correctness
variable
Rule
xsd:string
One of the variables that appear in the rule
valueOfVariable Rule
xsd:string
Value of the variable inferred by the rule
predicate
Rule
Predicate
One of the predicates that appear in the rule
predicateName Predicate
xsd:string
Name of the predicate
firstVariable Predicate
xsd:string
First variable of the predicate
secondVariable Predicate
xsd:string
Second variable of the predicate
Property
Table 12: Description of SpreadsheetEdits metadata properties
rdfs:range
Property rdfs:domain
Description
user
SpreadsheetEditsExecution xsd:string
User that made the modification
SpreadsheetEditsExecution xsd:string
entity
Entity of the belief affected by the modification
relation SpreadsheetEditsExecution xsd:string
Relation of the belief affected by the modification
SpreadsheetEditsExecution xsd:string
value
Value of the belief affected by the modification
action
SpreadsheetEditsExecution xsd:string
Action made in the modification
SpreadsheetEditsExecution xsd:string
file
File where the modification was saved and then read by SpreadsheetEdits
16
Technical Report. November 2017.
Fig. 11: RL metadata ontology
Fig. 12: SEAL metadata ontology
17
Technical Report. November 2017.
Fig. 13: Semparse metadata ontology
Fig. 14: SpreadsheetEdits metadata ontology
18
Technical Report. November 2017.
and the provenance metadata12 is common for both dumps. Metadata about the
dataset13 is modeled using VoID and DCAT vocabularies.
In order to attach the metadata to each belief, we need to reify the statement into a resource. We follow five different models, described down below. A
graphical representation of the models is shown in Figure 15. A summary of the
triples and resources of each model can be seen in Table 13.
– RDF Reification [2, Sec. 5.3] represents the statement using a resource, and
then creates triples to indicate the subject, predicate and object of the statement.
– N-Ary relations [19]: This model creates a new resource that identifies the
relation and connects subject and object using different design patterns.
Wikidata14 makes use of this model of annotation.
– Named Graphs [5]: A forth element is added to each triple, that can be
used to identify a triple or set of triples later on. This model is used by
Nano-publications [17].
– The Singleton Property [18] creates a unique property for each triple, related
to the original one. It defines its own semantics that extend RDF, RDFS.
– NdFluents [10] creates a unique version of the subject and the object (in
the case it is not a literal) of the triple, and attaches them to the original
resources and the context of the statement.
Table 13: Summary of dataset stats for each model
Promoted
Candidates
Total
Model
Size Triples Size Triples Size Triples
W/O metadata 2.99GB 0.02B 162GB 1.45B 165GB 1.48B
RDF Reification 50.9GB 0.24B 776GB 4.50B 827GB 4.74B
N-Ary Relations 50.7GB 0.24B 770GB 4.50B 821GB 4.74B
Named Graphs 49.8GB 0.24B 727GB 4.24B 777GB 4.48B
Singleton Property 49.8GB 0.24B xxxGB x.xxB xxxGB x.xxB
NdFluents 51.3GB 0.25B xxxGB x.xxB xxxGB x.xxB
5
Discussion and Future Work
In this work we present the conversion of both data and metadata from NELL
into RDF. It presents a thesaurus of entities and binary relations between them,
as well as a number of lexicalizations for each entity. It also includes detailed
provenance metadata along with confidence scores, encoded using five different
reification approaches.
12
13
14
https://w3id.org/nellrdf/provenance/ontology/nellrdf.ontology.n3
https://w3id.org/nellrdf/metadata/nellrdf.metadata.n3
https://www.wikidata.org
19
(b) RDF Reification
(c) Singleton Property
(d) Named Graphs
(e) N-Ary Properties
(f) NdFluents
20
Technical Report. November 2017.
(a) Original Triple
Fig. 15: Reification models
Technical Report. November 2017.
Our goals for this dataset are twofold: First, we want to improve WDAquacore0 [6] query answering system, providing it with more relations and lexicalizations, along with confidence scores that can help to give hints about how
trustworthy is the answer. Second, given that it contains a big proportion of
metadata statements, we want to use it as a testbed to compare how the different different metadata representations behave in current triplestores.
While currently we only publish the dumps of the datasets, we plan to provide
SPARQL endpoint and full dereferenceable URLs. In addition, NELL is starting
to be explored in languages different than English, such as Portuguese [7, 11] and
French [8]. Our intention is to convert those datasets to RDF as they become
available to the public, since the system and knowledge base are exactly the
same used in the English one.
Acknowledgements: This work is supported by funding from the EU H2020 research and innovation program under the Marie Sklodowska-Curie grant No
642795. We would like to thank Bryan Kisiel from NELL’s CMU team for the
technical support about NELL’s components.
References
[1] Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with CoTraining. Proceedings of the Eleventh Annual Conference on Computational
Learning Theory (1998)
[2] Brickley, D., Guha, R.: RDF Schema 1.1 - W3c Recommendation. Tech.
Rep. 9780123735560 (2014)
[3] Carlson, A., Betteridge, J., Hruschka, Jr., E.R., Mitchell, T.M.: Coupling
Semi-Supervised Learning of Categories and Relations. Proceedings of the
NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (2009)
[4] Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R.,
Mitchell, T.M.: Toward an Architecture for Never-Ending Language Learning. Proceedings of the Twenty-Fourth Conference on Artificial Intelligence
(AAAI) (2010)
[5] Carroll, J.J., Bizer, C., Hayes, P.J., Stickler, P., Ellis, A., Hagino, T.: Named
Graphs, Provenance and Trust. Tech. rep., ACM (2005)
[6] Diefenbach, D., Singh, K., Maret, P.: WDAqua-Core0: A Question Answering Component for the Research Community. ESWC, 7th Open Challenge
on Question Answering over Linked Data (QALD-7) (2017)
[7] Duarte, M.C., Hruschka, Jr., E.R.: How to Read The Web In Portuguese
Using the Never-Ending Language Learner’s Principles. Proceedings of the
14th International Conference on Intelligent Systems Design and Applications (2014)
[8] Duarte, M.C., Maret, P.: Vers une instance française de NELL : chaı̂ne TLN
multilingue et modélisation d’ontologie. Revue des Nouvelles Technologies
de l’Information Extraction et Gestion des Connaissances, RNTI-E-33, 469–
472 (2017)
21
Technical Report. November 2017.
[9] Gardner, M., Talukdar, P.P., Krishnamurthy, J., Mitchell, T.M.: Incorporating Vector Space Similarity in Random Walk Inference over Knowledge
Bases. Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (2014)
[10] Giménez-Garcı́a, J.M., Zimmermann, A., Maret, P.: NdFluents: An Ontology for Annotated Statements with Inference Preservation. Proceedings of
the 14th Extended Semantic Web Conference (ESWC) (2017)
[11] Hruschka, Jr., E.R., Duarte, M.C., Nicoletti, M.C.: Coupling as Strategy
for Reducing Concept-Drift in Never-Ending Learning Environments. Fundamenta Informaticae (1) (2013)
[12] Krishnamurthy, J., Mitchell, T.M.: Joint Syntactic and Semantic Parsing
with Combinatory Categorial Grammar. Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (ACL) (2014)
[13] Lao, N., Mitchell, T., Cohen, W.W.: Random Walk Inference and Learning in A Large Scale Knowledge Base. Proceedings of the Conference on
Empirical Methods in Natural Language Processing (2011)
[14] Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar,
D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The Prov
Ontology. W3C Recommendation (2013)
[15] Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., Schasberger, B.: The Penn Treebank: Annotating Predicate Argument Structure. Proceedings of the Workshop on Human Language Technology (1994)
[16] Mitchell, T.M., Cohen, W.W., Hruschka, Jr., E.R., Talukdar, P.P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios,
E.A., Ritter, A., Samadi, M., Settles, B., Wang, R.C., Wijaya, D.T., Gupta,
A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-Ending Learning.
Proceedings of the 29th AAAI Conference on Artificial Intelligence (2015)
[17] Mons, B., Velterop, J.: Nano-Publication in the e-science era. Workshop on
Semantic Web Applications in Scientific Discourse (SWASD) (2009)
[18] Nguyen, V., Bodenreider, O., Sheth, A.: Don’t like RDF Reification?: Making Statements about Statements Using Singleton Property. Proceedings of
the 23rd International Conference on the World Wide Web (WWW) (2014)
[19] Noy, N., Rector, A., Hayes, P., Welty, C.: Defining N-Ary Relations on the
Semantic Web. Tech. rep. (2006)
[20] Quinlan, J.R., Cameron-Jones, R.M.: FOIL: A Midterm Report. Proceedings of the European Conference on Machine Learning (1993)
[21] Samadi, M., Veloso, M.M., Blum, M.: OpenEval: Web Information Query
Evaluation. Proceedings of the 27th AAAI Conference on Artificial Intelligence, July 14-18, 2013, Bellevue, Washington, USA. (2013)
[22] Wang, R.C., Cohen, W.W.: Language-Independent Set Expansion of Named
Entities Using the Web. Proceedings of the 7th IEEE International Conference on Data Mining (2007)
[23] Yang, B., Mitchell, T.M.: Joint Extraction of Events and Entities within
a Document Context. Proceedings of the 2016 Conference of the North
22
Technical Report. November 2017.
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL HLT) (2016)
[24] Zimmermann, A., Gravier, C., Subercaze, J., Cruzille, Q.: Nell2rdf: Read
the Web, and Turn it into RDF. CEUR Workshop Proceedings (2013)
23