Papers by José M Giménez-García
International Semantic Web Conference, Dec 31, 2022
Following the Linked Data principles means maximising the reusability of data over the Web. Reuse... more Following the Linked Data principles means maximising the reusability of data over the Web. Reuse of datasets can become apparent when datasets are linked to from other datasets, and referred in scientific articles or community discussions. It can thus be measured, similarly to citations of papers. In this paper we propose dataset reuse metrics and use these metrics to analyse indications of dataset reuse in different communication channels within a scientific community. In particular we consider mailing lists and publications in the Semantic Web community and their correlation with data interlinking. Our results demonstrate that indications of dataset reuse across different communication channels and reuse in terms of data interlinking are positively correlated.
The current decade is a witness to an enormous explosion of data being published on the Web as Li... more The current decade is a witness to an enormous explosion of data being published on the Web as Linked Data to maximise its reusability. Answering questions that users speak or write in natural language is an increasingly popular application scenario for Web Data, especially when the domain of the questions is not limited to a domain where dedicated curated datasets exist, like in medicine. The increasing use of Web Data in this and other settings has highlighted the importance of assessing its quality. While quite some work has been done with regard to assessing the quality of Linked Data, only few efforts have been dedicated to quality assessment of linked data from the question answering (QA) perspective. From the linked data quality metrics that have so far been well documented in the literature, we have identified those that are most relevant for QA. We apply these quality metrics, implemented in the Luzzu framework, to subsets of two datasets of crucial importance to open domain QA-DBpedia and Wikidata-and thus present the first assessment of the quality of these datasets for QA. From these datasets, we assess slices covering the specific domains of restaurants, politicians, films and soccer players. The results of our experiments suggest that for most of these domains, the quality of Wikidata with regard to the majority of relevant metrics is higher than that of DBpedia.
Ecosistemas, Dec 28, 2022
Autores. Editado por la AEET. [Ecosistemas no se hace responsable del uso indebido de material su... more Autores. Editado por la AEET. [Ecosistemas no se hace responsable del uso indebido de material sujeto a derecho de autor] Palabras clave: acceso a datos geoespaciales; datos abiertos enlazados; datos abiertos forestales; interfaces de usuario; visualizaciones de mapas Exploring massive forestry open data with a web browse Abstract: Forest Explorer is a web application for easy browsing the contents of Iberian forest inventories and land cover maps. It exploits a source of Linked Open Data that was created in the European project Cross-Forest from the original sources. The application is available at https://forestexplorer.gsic.uva.es/ and can be accessed through a simple web browser in desktop computers, tablets, and mobile devices. The user interface hides the complexity of the underlying technologies, offering an interactive map for navigating the area of interest and presenting forestry data with the right level of detail. The application allows a professional usage, as well as a more casual one for scientific disseminators, data scientists, or citizens. Until now 8900 users have employed Forest Explorer and the application has appeared multiple times in media.
arXiv (Cornell University), Apr 16, 2018
NELL is a system that continuously reads the Web to extract knowledge in form of entities and rel... more NELL is a system that continuously reads the Web to extract knowledge in form of entities and relations between them. It has been running since January 2010 and extracted over 50,000,000 candidate statements. NELL's generated data comprises all the candidate statements together with detailed information about how it was generated. This information includes how each component of the system contributed to the extraction of the statement, as well as when that happened and how confident the system is in the veracity of the statement. However, the data is only available in an ad hoc CSV format that makes it difficult to exploit out of the context of NELL. In order to make it more usable for other communities, we adopt Linked Data principles to publish a more standardized, self-describing dataset with rich provenance metadata.
Lecture Notes in Computer Science, 2018
With the increasing amount of structured data on the web the need to understand and support searc... more With the increasing amount of structured data on the web the need to understand and support search over this emerging data space is growing. Adding semantics to structured data can help address existing challenges in data discovery, as it facilitates understanding the values in their context. While there are approaches on how to lift structured data to semantic web formats to enrich it and facilitate discovery, most work to date focuses on textual fields rather than numerical data. In this paper, we propose a two level (row and column based) approach to add semantic meaning to numerical values in tables, called NUMER. We evaluate our approach using a benchmark (NumDB) generated for the purpose of this work. We show the influence of the different levels of analysis on the success of assigning semantic labels to numerical values in tables. Our approach outperforms the state of the art and is less affected by data structure and quality issues such as a small number of entities or deviations in the data.
Springer eBooks, 2017
RDF provides the means to publish, link, and consume heterogeneous information on the Web of Data... more RDF provides the means to publish, link, and consume heterogeneous information on the Web of Data, whereas OWL allows the construction of ontologies and inference of new information that is implicit in the data. Annotating RDF data with additional information, such as provenance, trustworthiness, or temporal validity is becoming more and more important in recent times; however, it is possible to natively represent only binary (or dyadic) relations between entities in RDF and OWL. While there are some approaches to represent metadata on RDF, they lose most of the reasoning power of OWL. In this paper we present an extension of Welty and Fikes' 4dFluents ontology-on associating temporal validity to statements-to any number of dimensions, provide guidelines and design patterns to implement it on actual data, and compare its reasoning power with alternative representations.
Lecture Notes in Computer Science, 2020
Data generation in RDF has been increasing over the last years as a means to publish heterogeneou... more Data generation in RDF has been increasing over the last years as a means to publish heterogeneous and interconnected data. RDF is usually serialized in verbose text formats, which is problematic for publishing and managing huge datasets. HDT is a binary serialization of RDF that makes use of compact data structures, making it possible to publish and query highly compressed RDF data. This allows to reduce both the volume needed to store it and the speed at which it can be transferred or queried. However, it moves the burden of dealing with huge amounts of data from the consumer to the publisher, who needs to serialize the text data into HDT. This process consumes a lot of resources in terms of time, processing power, and especially memory. In addition, adding data to a file in HDT format is currently not possible, whether this additional data is in plain text or already serialized into HDT.
Lecture Notes in Computer Science, 2017
Annotating semantic data with metadata is becoming more and more important to provide information... more Annotating semantic data with metadata is becoming more and more important to provide information about the statements. While there are solutions to represent temporal information about a statement, a general annotation framework which allows representing more contextual information is needed. In this paper, we extend the 4dFluents ontology by Welty and Fikes to any dimension of context.
Semantic web, Feb 3, 2022
Forest Explorer is a web tool that can be used to easily browse the contents of the Cross-Forest ... more Forest Explorer is a web tool that can be used to easily browse the contents of the Cross-Forest dataset, a Linked Open Data resource containing the forestry inventory and land cover map from Spain. The tool is purposed for domain experts and lay users to facilitate the exploration of forestry data. Since these two groups are not knowledgable on Semantic Web, the user interface is designed to hide the complexity of RDF, OWL or SPARQL. An interactive map is provided for this purpose, allowing users to navigate to the area of interest and presenting forestry data with different levels of detail according to the zoom level. Forest Explorer offers different filter controls and is localized to English and Spanish. All the data is retrieved from the Cross-Forest and DBpedia endpoints through the Data manager. This component feeds the different Feature managers with the data needed to be displayed in the map. The Data manager uses a reduced set of SPARQL templates to accommodate any data request of the Feature managers. Caching and smart geographic querying are employed to limit data exchanges with the endpoint. A live version of the tool is freely available for everybody that wants to try it-any device with a modern browser should be sufficient to test it. Since December 2019, more than 3,200 users have employed Forest Explorer and it has appeared 12 times in the Spanish media. Results from a user study with 28 participants (mainly domain experts) show that Forest Explorer can be used to easily navigate the contents of the Cross-Forest dataset. No important limitations were found, only feature requests such as the integration of new datasets from other countries that are part of our future work.
Lecture Notes in Computer Science, 2016
While a number of quality metrics have been successfully proposed for datasets in the Web of Data... more While a number of quality metrics have been successfully proposed for datasets in the Web of Data, there is a lack of trust metrics that can be computed for any given dataset. We argue that reuse of data can be seen as an act of trust. In the Semantic Web environment, datasets regularly include terms from other sources, and each of these connections express a degree of trust on that source. However, determining what is a dataset in this context is not straightforward. We study the concepts of dataset and dataset link, to finally use the concept of Pay-Level Domain to differentiate datasets, and consider usage of external terms as connections among them. Using these connections we compute the PageRank value for each dataset, and examine the influence of ignoring predicates for computation. This process has been performed for more than 300 datasets, extracted from the LOD Laundromat. The results show that reuse of a dataset is not correlated with its size, and provide some insight on the limitations of the approach and ways to improve its efficacy.
Lecture Notes in Computer Science, 2015
HDT a is binary RDF serialization aiming at minimizing the space overheads of traditional RDF for... more HDT a is binary RDF serialization aiming at minimizing the space overheads of traditional RDF formats, while providing retrieval features in compressed space. Several HDT-based applications, such as the recent Linked Data Fragments proposal, leverage these features for diverse publication, interchange and consumption purposes. However, scalability issues emerge in HDT construction because the whole RDF dataset must be processed in a memory-consuming task. This is hindering the evolution of novel applications and techniques at Web scale. This paper introduces HDT-MR, a MapReduce-based technique to process huge RDF and build the HDT serialization. HDT-MR performs in linear time with the dataset size and has proven able to serialize datasets up to several billion triples, preserving HDT compression and retrieval features.
Data published on the Web is growing every year. However, most of this data does not have semanti... more Data published on the Web is growing every year. However, most of this data does not have semantic representation. Web tables are an example of structured data on the Web that has no clear semantics. While there is an emerging research effort in lifting tabular data into semantic web formats, most of the work is focused around entity recognition in tables with simple structure. In this work we explore how capture the semantics of complex tables and transform them to knowledge graph. These complex tables include contextual information about statements, such as time or provenance. Hence, we need to use contextualized knowledge graphs to represent the information of the tables. We explore how this contextual information is represented in tables, and relate it to previous classifications of web tables, and how to encode it in RDF using different approaches. Finally, we present a prototype tool that converts web tables from Wikipedia into RDF, trying to cover all existing approaches.
arXiv (Cornell University), Sep 18, 2018
HDT (Header, Dictionary, Triples) is a serialization for RDF. HDT has become very popular in the ... more HDT (Header, Dictionary, Triples) is a serialization for RDF. HDT has become very popular in the last years because it allows to store RDF data with a small disk footprint, while remaining at the same time queriable. For this reason HDT is often used when scalability becomes an issue. Once RDF data is serialized into HDT, the disk footprint to store it and the memory footprint to query it are very low. However, generating HDT files from raw text RDF serializations (like N-Triples) is a timeconsuming and (especially) memory-consuming task. In this publication we present HDTCat, an algorithm and command line tool to join two HDT files with low memory footprint. HDTCat can be used in a divideand-conquer strategy to generate HDT files from huge datasets using a low-memory footprint.
arXiv (Cornell University), Sep 14, 2017
We address the problem of providing contextual information about a logical formula (e.g., provena... more We address the problem of providing contextual information about a logical formula (e.g., provenance, date of validity, or confidence) and representing it within a logical system. In this case, it is needed to rely on a higher order or non standard formalism, or some kind of reification mechanism. We explore the case of reification and formalize the concept of contextualizing logical statements in the case of Description Logics. Then, we define several properties of contextualization that are desirable. No previous approaches satisfy all of the them. Consequently, we define a new way of contextually annotating statements. It is inspired by NdFluents, which is itself an extension of the 4dFluents approach for annotating statements with temporal context. In NdFluents, instances that are involved in a contextual statement are sliced into contextual parts, such that only parts in the same context hold relations to one another, with the goal of better preserving inferences. We generalize this idea by defining contextual parts of relations and classes. This formal construction better satisfies the properties, although not entirely. We show that it is a particular case of a general mechanism that NdFluents also instantiates, and present other variations.
The use of RDF to expose semantic data on the Web has seen a dramatic increase over the last few ... more The use of RDF to expose semantic data on the Web has seen a dramatic increase over the last few years. Nowadays, RDF datasets are so big and interconnected that, in fact, classical mono-node solutions present significant scalability problems when trying to manage big semantic data. MapReduce, a standard framework for distributed processing of great quantities of data, is earning a place among the distributed solutions facing RDF scalability issues. In this article, we survey the most important works addressing RDF management and querying through diverse MapReduce approaches, with a focus on their main strategies, optimizations and results.
Lecture Notes in Computer Science, 2020
While RDF was designed to make data easily readable by machines, it does not make data easily usa... more While RDF was designed to make data easily readable by machines, it does not make data easily usable by end-users. Question Answering (QA) over Knowledge Graphs (KGs) is seen as the technology which is able to bridge this gap. It aims to build systems which are capable of extracting the answer to a user's natural language question from an RDF dataset. In recent years, many approaches were proposed which tackle the problem of QA over KGs. Despite such efforts, it is hard and cumbersome to create a Question Answering system on top of a new RDF dataset. The main open challenge remains portability, i.e., the possibility to apply a QA algorithm easily on new and previously untested RDF datasets. In this publication, we address the problem of portability by presenting an architecture for a portable QA system. We present a novel approach called QAnswer KG, which allows the construction of on-demand QA systems over new RDF datasets. Hence, our approach addresses nonexpert users in QA domain. In this paper, we provide the details of QA system generation process. We show that it is possible to build a QA system over any RDF dataset while requiring minimal investments in terms of training. We run experiments using 3 different datasets. To the best of our knowledge, we are the first to design a process for non-expert users. We enable such users to efficiently create an ondemand, scalable, multilingual, QA system on top of any RDF dataset.
Ecosistemas
El Explorador Forestal es una aplicación web con la que se puede navegar fácilmente los contenido... more El Explorador Forestal es una aplicación web con la que se puede navegar fácilmente los contenidos de inventarios forestales y mapas forestales ibéricos. Para ello accede a una fuente de datos abiertos enlazados creada en el proyecto europeo Cross-Forest a partir de los datos originales. La aplicación está disponible en https://forestexplorer.gsic.uva.es/ y puede accederse con un simple navegador web en ordenadores de sobremesa, tabletas y móviles. La interfaz de usuario esconde la complejidad de las tecnologías subyacentes, proporcionando un mapa interactivo para navegar a la zona de interés y presentando los datos forestales con el detalle adecuado. La aplicación permite tanto un uso profesional como uno más casual para divulgadores científicos, periodistas de datos o ciudadanos. Hasta el momento 8900 usuarios han empleado el Explorador Forestal y ha aparecido múltiples veces en medios de comunicación.
The Semantic Web, 2020
While RDF was designed to make data easily readable by machines, it does not make data easily usa... more While RDF was designed to make data easily readable by machines, it does not make data easily usable by end-users. Question Answering (QA) over Knowledge Graphs (KGs) is seen as the technology which is able to bridge this gap. It aims to build systems which are capable of extracting the answer to a user's natural language question from an RDF dataset. In recent years, many approaches were proposed which tackle the problem of QA over KGs. Despite such efforts, it is hard and cumbersome to create a Question Answering system on top of a new RDF dataset. The main open challenge remains portability, i.e., the possibility to apply a QA algorithm easily on new and previously untested RDF datasets. In this publication, we address the problem of portability by presenting an architecture for a portable QA system. We present a novel approach called QAnswer KG, which allows the construction of on-demand QA systems over new RDF datasets. Hence, our approach addresses nonexpert users in QA domain. In this paper, we provide the details of QA system generation process. We show that it is possible to build a QA system over any RDF dataset while requiring minimal investments in terms of training. We run experiments using 3 different datasets. To the best of our knowledge, we are the first to design a process for non-expert users. We enable such users to efficiently create an ondemand, scalable, multilingual, QA system on top of any RDF dataset.
Main changes: a lot of bug fixes added <code>hdtCat</code> to concatenate HDT files s... more Main changes: a lot of bug fixes added <code>hdtCat</code> to concatenate HDT files switched the dictionary to a 64bit counter to avoid overflow issues with large HDT files first (2.x) version to be released on maven central since 2014 updated all dependencies to the most recent version compilation fixed for Java 1.8 aligned the licenses with what is described in https://github.com/rdfhdt/hdt-java/blob/master/LICENSE added minimal docker support
Uploads
Papers by José M Giménez-García