An ontology is a machine processable artifact that captures knowledge about some domain of intere... more An ontology is a machine processable artifact that captures knowledge about some domain of interest. Ontologies are used in various domains including healthcare, science, and commerce. In this paper we examine the ontology bootstrapping problem. Specifically, we look at an approach that uses both competency questions and knowledge source reuse via recommendations to address the "cold start problem" that is, the task of creating an ontology from scratch. We describe this approach, an implementation of it, and we present an evaluation in the form of a controlled user study. We find that the approach leads users into creating significantly more detailed initial ontologies that have a greater domain coverage than ontologies produced without this support. Furthermore, in spite of a more involved workflow, the usability and user satisfaction of the bootstrapping approach is as good as a state-of-the-art ontology editor with no additional support.
The availability of high-quality metadata is key to facilitating discovery in the large variety o... more The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this paper, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
We present Snap-SPARQL, which is a Java framework for working with SPARQL and OWL. The framework ... more We present Snap-SPARQL, which is a Java framework for working with SPARQL and OWL. The framework includes a parser, axiom template API, SPARQL algebra implementation, and graphical user interface components for reading, processing and executing SPARQL queries under the SPARQL 1.1 OWL Entailment Regime. While the framework was originally designed to support the implementation of a SPARQL teaching aid in the form of a Protégé plugin, we believe that it is more generally useful and may be of interest to developers and researchers working on SPARQL 1.1 OWL entailment regime implementations and optimisations. The framework is open source and pluggable.
Ontologies are complex intellectual artifacts and creating them requires significant expertise an... more Ontologies are complex intellectual artifacts and creating them requires significant expertise and effort. While existing ontologyediting tools and methodologies propose ways of building ontologies in a normative way, empirical investigations of how experts actually construct ontologies "in the wild" are rare. Yet, understanding actual user behavior can play an important role in the design of effective tool support. Although previous empirical investigations have produced a series of interesting insights, they were exploratory in nature and aimed at gauging the problem space only. In this work, we aim to advance the state of knowledge in this domain by systematically defining and comparing a set of hypotheses about how users edit ontologies. Towards that end, we study the user editing trails of four real-world ontologyengineering projects. Using a coherent research framework, called Hyp-Trails, we derive formal definitions of hypotheses from the literature, and systematically compare them with each other. Our findings suggest that the hierarchical structure of an ontology exercises the strongest influence on user editing behavior, followed by the entity similarity, and the semantic distance of classes in the ontology. Moreover, these findings are strikingly consistent across all ontology-engineering projects in our study, with only minor exceptions for one of the smaller datasets. We believe that our results are important for ontology tools builders and for project managers, who can potentially leverage this information to create user interfaces and processes that better support the observed editing patterns of users.
The metadata about scientific experiments published in online repositories have been shown to suf... more The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity-there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.
International journal of human-computer studies, Dec 1, 2015
With the growing popularity of large-scale collaborative ontologyengineering projects, such as th... more With the growing popularity of large-scale collaborative ontologyengineering projects, such as the creation of the 11 th revision of the International Classification of Diseases, we need new methods and insights to help project-and community-managers to cope with the constantly growing complexity of such projects. In this paper, we present a novel application of Markov chains to model sequential usage patterns that can be found in the change-logs of collaborative ontology-engineering projects. We provide a detailed presentation of the analysis process, describing all the required steps that are necessary to apply and determine the best fitting Markov chain model. Amongst others, the model and results allow us to identify structural properties and regularities as well as predict future actions based on usage sequences. We are specifically interested in determining the appropriate Markov chain orders which postulate on how many previous actions future ones depend on. To demonstrate the practical usefulness of the extracted Markov chains we conduct sequential pattern analyses on a large-scale collaborative ontology-engineering dataset, the International Classification of Diseases in its 11 th revision. To further expand on the usefulness of the presented analysis, we show that the collected sequential patterns provide potentially actionable information for user-interface designers, ontology-engineering tool developers and project-managers to monitor, coordinate and dynamically adapt to the natural development processes that occur when collaboratively engineering an ontology. We hope that presented work will spur a new line of ontology-development tools, evaluation-techniques and new insights, further taking the interactive nature of the collaborative ontology-engineering process into consideration.
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that ... more The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed¾the CEDAR Workbench¾is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, RESTbased environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
One of the original motivations behind ontology research was the belief that ontologies can help ... more One of the original motivations behind ontology research was the belief that ontologies can help with reuse in knowledge representation. However, many of the ontologies that are developed with reuse in mind, such as standard reference ontologies and controlled terminologies, are extremely large, while the users often need to reuse only a small part of these resources in their work. Specifying various views of an ontology enables users to limit the set of concepts that they see. In this paper, we develop the concept of a Traversal View, a view where a user specifies the central concept or concepts of interest, the relationships to traverse to find other concepts to include in the view, and the depth of the traversal. For example, given a large ontology of anatomy, a user may use a Traversal View to extract a concept of Heart and organs and organ parts that surround the heart or are contained in the heart. We define the notion of Traversal Views formally, discuss their properties, present a strategy for maintaining the view through ontology evolution and describe our tool for defining and extracting Traversal Views.
Metadata that are structured using principled schemas and that use terms from ontologies are esse... more Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semistructured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.
The emergence of the FAIR principles is driving renewed efforts in the biomedical community to pr... more The emergence of the FAIR principles is driving renewed efforts in the biomedical community to produce high-quality metadata that describe datasets submitted to public repositories. A variety of organizations are now involved in developing submission pipelines that place a strong emphasis on accompanying submissions with highly descriptive metadata. However, these pipelines have highly variable requirements, which range from using ontology-based metadata in existing submission pipelines to supporting end-to-end metadata management in new pipelines. There is a lack of tools for integrating metadata support when building these pipelines. In this paper we describe a system called CEDAR that aims to address this challenge. The described tools provide a flexible, highly configurable solution for producing submission workflows with semantically rich metadata support. We outline how we have used these tools to deliver robust metadata submission pipelines for several communities, including the Adaptive Immune Receptor Repertoire (AIRR), the NIH Cloud Credits Model Pilot (CCP), and the Library of Integrated Network-based Cellular Signatures (LINCS).
It is challenging to determine whether datasets are findable, accessible, interoperable, and reus... more It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets-both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.
Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develo... more Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develop. Thus, a crucial prerequisite for ontology adoption and reuse is effective support for exploring and finding existing ontologies. Towards that goal, the National Center for Biomedical Ontology (NCBO) has developed BioPortal-an online repository designed to support users in exploring and finding more than 500 existing biomedical ontologies. In 2016, BioPortal represents one of the largest portals for exploration of semantic biomedical vocabularies and terminologies, which is used by many researchers and practitioners. While usage of this portal is high, we know very little about how exactly users search and explore ontologies and what kind of usage patterns or user groups exist in the first place. Deeper insights into user behavior on such portals can provide valuable information to devise strategies for a better support of users in exploring and finding existing ontologies, and thereby enable better ontology reuse. To that end, we study and group users according to their browsing behavior on Bio-Portal using data mining techniques. Additionally, we use the obtained groups to characterize and compare exploration strategies across ontologies. In particular, we were able to identify seven distinct browsing-behavior types, which all make use of different functionality provided by BioPortal. For example, Search Explorers make extensive use of the search functionality while Ontology Tree Explorers mainly rely on the class hierarchy to explore ontologies. Further, we show that specific characteristics of ontologies influence the way users explore and interact with the website. Our results may guide the development of more user-oriented systems for ontology exploration on the Web.
Motivation: Schema.org is an initiative by major Web search engines to define a common vocabular... more Motivation: Schema.org is an initiative by major Web search engines to define a common vocabulary for structuring Web content from a variety of domains, promoting data interoperability and enabling Web content to benefit from sophisticated search services. Within the wide spectrum of schema.org vocabulary, there are specialized data attributes for biomedical objects. Before leveraging these attributes to mark up the actual data, it is valuable for biomedical data publishers to know which of their key data fields can be captured by schema.org. There are currently no quantitative evaluations to measure how much of schema.org vocabulary aligns with the accepted standards in biomedical domains. In this paper, we provide such an evaluation against selected biomedical standards for drugs, clinical trials and medical datasets.
While the biomedical community has published several "open data" sources in the last decade, most... more While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 publicly available biomedical linked data graphs into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.
Pinterest is a popular Web application that has over 250 million active users. It is a visual dis... more Pinterest is a popular Web application that has over 250 million active users. It is a visual discovery engine for finding ideas for recipes, fashion, weddings, home decoration, and much more. In the last year, the company adopted Semantic Web technologies to create a knowledge graph that aims to represent the vast amount of content and users on Pinterest, to help both content recommendation and ads targeting. In this paper, we present the engineering of an OWL ontology-the Pinterest Taxonomy-that forms the core of Pinterest's knowledge graph, the Pinterest Taste Graph. We describe modeling choices and enhancements to WebProtégé that we used for the creation of the ontology. In two months, eight Pinterest engineers, without prior experience of OWL and Web-Protégé, revamped an existing taxonomy of noisy terms into an OWL ontology. We share our experience and present the key aspects of our work that we believe will be useful for others working in this area.
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
An ontology is a machine processable artifact that captures knowledge about some domain of intere... more An ontology is a machine processable artifact that captures knowledge about some domain of interest. Ontologies are used in various domains including healthcare, science, and commerce. In this paper we examine the ontology bootstrapping problem. Specifically, we look at an approach that uses both competency questions and knowledge source reuse via recommendations to address the "cold start problem" that is, the task of creating an ontology from scratch. We describe this approach, an implementation of it, and we present an evaluation in the form of a controlled user study. We find that the approach leads users into creating significantly more detailed initial ontologies that have a greater domain coverage than ontologies produced without this support. Furthermore, in spite of a more involved workflow, the usability and user satisfaction of the bootstrapping approach is as good as a state-of-the-art ontology editor with no additional support.
The availability of high-quality metadata is key to facilitating discovery in the large variety o... more The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this paper, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
We present Snap-SPARQL, which is a Java framework for working with SPARQL and OWL. The framework ... more We present Snap-SPARQL, which is a Java framework for working with SPARQL and OWL. The framework includes a parser, axiom template API, SPARQL algebra implementation, and graphical user interface components for reading, processing and executing SPARQL queries under the SPARQL 1.1 OWL Entailment Regime. While the framework was originally designed to support the implementation of a SPARQL teaching aid in the form of a Protégé plugin, we believe that it is more generally useful and may be of interest to developers and researchers working on SPARQL 1.1 OWL entailment regime implementations and optimisations. The framework is open source and pluggable.
Ontologies are complex intellectual artifacts and creating them requires significant expertise an... more Ontologies are complex intellectual artifacts and creating them requires significant expertise and effort. While existing ontologyediting tools and methodologies propose ways of building ontologies in a normative way, empirical investigations of how experts actually construct ontologies "in the wild" are rare. Yet, understanding actual user behavior can play an important role in the design of effective tool support. Although previous empirical investigations have produced a series of interesting insights, they were exploratory in nature and aimed at gauging the problem space only. In this work, we aim to advance the state of knowledge in this domain by systematically defining and comparing a set of hypotheses about how users edit ontologies. Towards that end, we study the user editing trails of four real-world ontologyengineering projects. Using a coherent research framework, called Hyp-Trails, we derive formal definitions of hypotheses from the literature, and systematically compare them with each other. Our findings suggest that the hierarchical structure of an ontology exercises the strongest influence on user editing behavior, followed by the entity similarity, and the semantic distance of classes in the ontology. Moreover, these findings are strikingly consistent across all ontology-engineering projects in our study, with only minor exceptions for one of the smaller datasets. We believe that our results are important for ontology tools builders and for project managers, who can potentially leverage this information to create user interfaces and processes that better support the observed editing patterns of users.
The metadata about scientific experiments published in online repositories have been shown to suf... more The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity-there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.
International journal of human-computer studies, Dec 1, 2015
With the growing popularity of large-scale collaborative ontologyengineering projects, such as th... more With the growing popularity of large-scale collaborative ontologyengineering projects, such as the creation of the 11 th revision of the International Classification of Diseases, we need new methods and insights to help project-and community-managers to cope with the constantly growing complexity of such projects. In this paper, we present a novel application of Markov chains to model sequential usage patterns that can be found in the change-logs of collaborative ontology-engineering projects. We provide a detailed presentation of the analysis process, describing all the required steps that are necessary to apply and determine the best fitting Markov chain model. Amongst others, the model and results allow us to identify structural properties and regularities as well as predict future actions based on usage sequences. We are specifically interested in determining the appropriate Markov chain orders which postulate on how many previous actions future ones depend on. To demonstrate the practical usefulness of the extracted Markov chains we conduct sequential pattern analyses on a large-scale collaborative ontology-engineering dataset, the International Classification of Diseases in its 11 th revision. To further expand on the usefulness of the presented analysis, we show that the collected sequential patterns provide potentially actionable information for user-interface designers, ontology-engineering tool developers and project-managers to monitor, coordinate and dynamically adapt to the natural development processes that occur when collaboratively engineering an ontology. We hope that presented work will spur a new line of ontology-development tools, evaluation-techniques and new insights, further taking the interactive nature of the collaborative ontology-engineering process into consideration.
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that ... more The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed¾the CEDAR Workbench¾is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, RESTbased environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
One of the original motivations behind ontology research was the belief that ontologies can help ... more One of the original motivations behind ontology research was the belief that ontologies can help with reuse in knowledge representation. However, many of the ontologies that are developed with reuse in mind, such as standard reference ontologies and controlled terminologies, are extremely large, while the users often need to reuse only a small part of these resources in their work. Specifying various views of an ontology enables users to limit the set of concepts that they see. In this paper, we develop the concept of a Traversal View, a view where a user specifies the central concept or concepts of interest, the relationships to traverse to find other concepts to include in the view, and the depth of the traversal. For example, given a large ontology of anatomy, a user may use a Traversal View to extract a concept of Heart and organs and organ parts that surround the heart or are contained in the heart. We define the notion of Traversal Views formally, discuss their properties, present a strategy for maintaining the view through ontology evolution and describe our tool for defining and extracting Traversal Views.
Metadata that are structured using principled schemas and that use terms from ontologies are esse... more Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semistructured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.
The emergence of the FAIR principles is driving renewed efforts in the biomedical community to pr... more The emergence of the FAIR principles is driving renewed efforts in the biomedical community to produce high-quality metadata that describe datasets submitted to public repositories. A variety of organizations are now involved in developing submission pipelines that place a strong emphasis on accompanying submissions with highly descriptive metadata. However, these pipelines have highly variable requirements, which range from using ontology-based metadata in existing submission pipelines to supporting end-to-end metadata management in new pipelines. There is a lack of tools for integrating metadata support when building these pipelines. In this paper we describe a system called CEDAR that aims to address this challenge. The described tools provide a flexible, highly configurable solution for producing submission workflows with semantically rich metadata support. We outline how we have used these tools to deliver robust metadata submission pipelines for several communities, including the Adaptive Immune Receptor Repertoire (AIRR), the NIH Cloud Credits Model Pilot (CCP), and the Library of Integrated Network-based Cellular Signatures (LINCS).
It is challenging to determine whether datasets are findable, accessible, interoperable, and reus... more It is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be "rich" and to adhere to "domain-relevant" community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these "rich," discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets-both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.
Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develo... more Ontologies in the biomedical domain are numerous, highly specialized and very expensive to develop. Thus, a crucial prerequisite for ontology adoption and reuse is effective support for exploring and finding existing ontologies. Towards that goal, the National Center for Biomedical Ontology (NCBO) has developed BioPortal-an online repository designed to support users in exploring and finding more than 500 existing biomedical ontologies. In 2016, BioPortal represents one of the largest portals for exploration of semantic biomedical vocabularies and terminologies, which is used by many researchers and practitioners. While usage of this portal is high, we know very little about how exactly users search and explore ontologies and what kind of usage patterns or user groups exist in the first place. Deeper insights into user behavior on such portals can provide valuable information to devise strategies for a better support of users in exploring and finding existing ontologies, and thereby enable better ontology reuse. To that end, we study and group users according to their browsing behavior on Bio-Portal using data mining techniques. Additionally, we use the obtained groups to characterize and compare exploration strategies across ontologies. In particular, we were able to identify seven distinct browsing-behavior types, which all make use of different functionality provided by BioPortal. For example, Search Explorers make extensive use of the search functionality while Ontology Tree Explorers mainly rely on the class hierarchy to explore ontologies. Further, we show that specific characteristics of ontologies influence the way users explore and interact with the website. Our results may guide the development of more user-oriented systems for ontology exploration on the Web.
Motivation: Schema.org is an initiative by major Web search engines to define a common vocabular... more Motivation: Schema.org is an initiative by major Web search engines to define a common vocabulary for structuring Web content from a variety of domains, promoting data interoperability and enabling Web content to benefit from sophisticated search services. Within the wide spectrum of schema.org vocabulary, there are specialized data attributes for biomedical objects. Before leveraging these attributes to mark up the actual data, it is valuable for biomedical data publishers to know which of their key data fields can be captured by schema.org. There are currently no quantitative evaluations to measure how much of schema.org vocabulary aligns with the accepted standards in biomedical domains. In this paper, we provide such an evaluation against selected biomedical standards for drugs, clinical trials and medical datasets.
While the biomedical community has published several "open data" sources in the last decade, most... more While the biomedical community has published several "open data" sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 publicly available biomedical linked data graphs into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.
Pinterest is a popular Web application that has over 250 million active users. It is a visual dis... more Pinterest is a popular Web application that has over 250 million active users. It is a visual discovery engine for finding ideas for recipes, fashion, weddings, home decoration, and much more. In the last year, the company adopted Semantic Web technologies to create a knowledge graph that aims to represent the vast amount of content and users on Pinterest, to help both content recommendation and ads targeting. In this paper, we present the engineering of an OWL ontology-the Pinterest Taxonomy-that forms the core of Pinterest's knowledge graph, the Pinterest Taste Graph. We describe modeling choices and enhancements to WebProtégé that we used for the creation of the ontology. In two months, eight Pinterest engineers, without prior experience of OWL and Web-Protégé, revamped an existing taxonomy of noisy terms into an OWL ontology. We share our experience and present the key aspects of our work that we believe will be useful for others working in this area.
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Uploads
Papers by Mark Musen