For many users or automated agents, working with knowledge graphs may be a complicated task. Inde... more For many users or automated agents, working with knowledge graphs may be a complicated task. Indeed, multiple tools using knowledge graphs rely on semantics to perform at their best. For example, in the context of data integration, some instance matching tools use semantic features such as functional and inverse functional properties or disjoint classes to discover instances that are the same (or not). Hence, in many cases, conducting an exploratory study is required to discover which semantic features are used or defined in a knowledge graph. In this paper, we propose an ontology and a large-scale ontology-based Web service that provides statistics about the use of OWL 2 and RDFS semantic features (e.g. functional properties or subclasses) in knowledge graphs. This will allow a human or automatic agent to choose the most appropriate tool or data for a given task. It also gives the data publishers a clear picture about the semantics they provide to data consumers. These statistics are represented in the form of an RDF graph (with different serialization possibilities), making them easy to use and share.
For many decades, Business Intelligence and Analytics (BI&A) has been associated with relatio... more For many decades, Business Intelligence and Analytics (BI&A) has been associated with relational databases. In the era of big data and NoSQL stores, it is important to provide approaches and systems capable of analyzing this type of data for decision-making. In this paper, we present a new BI&A approach that both: (i) extracts, transforms and loads the required data for OLAP analysis (on-demand ETL) from document stores, and (ii) provides the models and the systems required for suitable OLAP analysis. We focus here, on the on-demand ETL stage where, unlike existing works, we consider the dispersion of data over two or more collections.
Résumé. Le nombre et la taille des graphes de connaissances RDF sont en constante augmentation. P... more Résumé. Le nombre et la taille des graphes de connaissances RDF sont en constante augmentation. Par conséquent, le traitement des données pour des agents (automatisés ou humains) devient de plus en plus difficile. Si plusieurs outils peuvent être utilisés pour une tâche donnée, mais qu’ils dépendent chacun à des degrés divers de la sémantique disponible dans le graphe de connaissances, alors il est important d’avoir un aperçu en amont du graphe pour sélectionner le meilleur outil pour cette tâche. Nous avons mené, à grande échelle, une étude approfondie pour vérifier la présence de sémantique dans les graphes de connaissances publiés actuellement dans le Web de données (Linked Data). Bien que certains graphes de connaissances utilisent la sémantique OWL 2, beaucoup ne le font pas ou partiellement. Nous proposons donc une approche qui, en se basant sur des statistiques, instancie une ontologie facilitant la sélection de l’outil le mieux adapté à une tâche donnée en fonction de l’util...
Information system (IS) quality can be characterized as a multidimensional system. It encompasses... more Information system (IS) quality can be characterized as a multidimensional system. It encompasses software quality as well as data quality. It also comprises model quality, service quality, process quality, and more generally IS quality. Modeling several aspects of IS quality leads to specific ontologies. To the best of our knowledge, there is no global ontology dedicated to all the dimensions of an IS. A single ontology federating all the aspects of quality is not available. The aim of this paper is to propose and discuss the main constituents of an ontology of quality federating all the aspects of IS components quality (software, data, models, etc.). In order to operationalize the proposed ontology, we describe an approach allowing us to use the ontology in order to achieve specific quality goals.
L’article decrit la problematique et les solutions proposees par le pro-jet QUADRIS (ARA-05MMSA-0... more L’article decrit la problematique et les solutions proposees par le pro-jet QUADRIS (ARA-05MMSA-0015)dont l’objectif est d’offrir un cadre d’evaluation de la qualite dans les systemes d’information multisources (SIM). Ce cadre a permis de definir un meta-modele pour etudier en particulier les inter-dependances entre les dimensions de la qualite d’un modele conceptuel de don-nees et celles de la qualite des donnees instanciant ce modele. Nous etudions la possibilite de definir des patterns d’evaluation de la qualite dans le but de : 1)formaliser les correlations entre les facteurs de qualite, 2) representer les processus, et 3) analyser la qualite des donnees, du systeme et son evolution. Le projet QUADRIS s’est engage a valider ses propositions dans les trois domaines d’application suivants : le domaine biomedical, le domaine commercial et le domaine geographique.
This paper presents an approach integrating data quality into the business intelligence chain in ... more This paper presents an approach integrating data quality into the business intelligence chain in the context of CRM applications at EDF (Electricite de France), the major electricity company in France. The main contribution of this paper is the definition and instantiation of a generic multi-dimensional star-like model for storing, analyzing and capitalizing data quality indicators, measurements and metadata. This approach is illustrated in one of EDF's CRM applications implementing the data quality-driven information supply chain for business intelligence where the role of the data quality expert is highly emphasized.
Avec la complexification des systemes d?information (systemes ubiquitaires, entreprises ouvertes ... more Avec la complexification des systemes d?information (systemes ubiquitaires, entreprises ouvertes etc.), de nombreux nouveaux langages de modelisation sont proposes. Face a ce developpement de langages specifiques, on peut s?interroger sur la qualite des modeles qui en sont issus. Cet article traite de ce probleme en tirant les lecons de nos experiences passees. Elles mettent en evidence les besoins d?outillage automatise pour l?evaluation de la qualite de modeles, la participation conjointe des differentes parties prenantes dans le processus d?evaluation, et la necessite d?envisager une veritable ingenierie des langages et des modeles centree sur l?humain.
This paper presents an approach integrating data quality into the business intelligence chain in ... more This paper presents an approach integrating data quality into the business intelligence chain in the context of customer-relationship management (CRM) applications at EDF (Electricite de France), the major electricity company in France. The main contribution of this paper is the definition and instantiation of a generic multi-dimensional star-like model for storing, analyzing and capitalizing data quality indicators, measurements and metadata. This approach is illustrated through one of EDF's CRM applications, implementing domain-specific quality indicators and providing quality-driven information management as a business intelligence chain. The role of the data quality expert is highly emphasized.
Un grand nombre de definitions du systeme d'information (SI) ont ete proposees dans la litter... more Un grand nombre de definitions du systeme d'information (SI) ont ete proposees dans la litterature. Il est maintenant bien accepte par les communautes manipulant ce concept que le systeme d'information combine un systeme d'information numerique et des activites humaines afin de servir de support aux branches operationnelles, a la gestion et a la prise de decision dans l'entreprise. Assurer la qualite du SI est donc crucial. Cependant, evaluer cette qualite est une tâche multidimensionnelle complexe, concernant differentes ressources (processus, donnees, artefacts, interfaces, ...), considerant differentes caracteristiques (utilisabilite, fonctionnalites, ...) et mettant en oeuvre differentes methodes pour effectuer l'evaluation. Apres plusieurs annees d'experience dans l'evaluation des SI, nous resumons dans ce papier les lecons que nous avons tirees des evaluations que nous avons menees, et exhibons les limites de l'evaluation d'un SI. Nous finis...
Proceedings of the Ninth International Conference on Enterprise Information Systems
Ensuring and maximizing the quality and integrity of information is a crucial process for today e... more Ensuring and maximizing the quality and integrity of information is a crucial process for today enterprise information systems (EIS). It requires a clear understanding of the interdependencies between the dimensions characterizing quality of data (QoD), quality of conceptual data model (QoM) of the database, keystone of the EIS, and quality of data management and integration processes (QoP). The improvement of one quality dimension (such as data accuracy or model expressiveness) may have negative consequences on other quality dimensions (e.g., freshness or completeness of data). In this paper we briefly present a framework, called QUADRIS, relevant for adopting a quality improvement strategy on one or many dimensions of QoD or QoM with considering the collateral effects on the other interdependent quality dimensions. We also present the scenarios of our ongoing validations on a CRM EIS.
The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate i... more The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate its fitness for use in an application. Several quality dimensions are identified, such as accuracy, completeness, timeliness, provenance, and accessibility, which are used to assess the quality. While many prior studies offer a landscape view of data quality dimensions, here we focus on presenting a systematic literature review for assessing the completeness of Knowledge Graph. We gather existing approaches from the literature and analyze them qualitatively and quantitatively. In particular, we unify and formalize commonly used terminologies across 56 articles related to the completeness dimension of data quality and provide a comprehensive list of methodologies and metrics used to evaluate the different types of completeness. We identify seven types of completeness, including three types that were not previously identified in previous surveys. We also analyze nine different tools capable of assessing Knowledge Graph completeness. The aim of this Systematic Literature Review is to provide researchers and data curators a comprehensive and deeper understanding of existing works on completeness and its properties, thereby encouraging further experimentation and development of new approaches focused on completeness as a data quality dimension of Knowledge Graph.
Digital music scores are a way to present music notation and lack of semantic information useful ... more Digital music scores are a way to present music notation and lack of semantic information useful for musicology purposes in order to manipulate music concepts. We propose a general approach to extend score encodings with semantic annotations. It relies on an ontology of music notation designed to integrate semantic music elements either extracted or produced by a knowledge process. We illustrate the whole mechanism by extracting RDF facts based on the identification of dissonances in Renaissance counterpoint.
The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bri... more The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bring high data quality to the data warehouse from both internal and external sources using the ETL process. The latter is complex and time-consuming as it manages data with heterogeneous content and diverse quality problems. Ensuring data quality requires tracking quality defects along the ETL process. In this paper, we present the main ETL quality characteristics. We provide an overview of the existing ETL process data quality approaches. We also present a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions. To illustrate our study, we carry out experiments using an ETL dedicated solution (Talend Data Integration) and a data quality dedicated solution (Talend Data Quality). Based on our study, we identify and discuss quality challenges to be addressed in our future research.
Abstract With an increase in the number of Linked Open Data datasets, insufficient interlinking q... more Abstract With an increase in the number of Linked Open Data datasets, insufficient interlinking quality can lead to a decrease in overall data quality. Therefore, it is necessary to keep the interlinking quality as high as possible. One of the main ways to link datasets is to use owl:sameAs links, i.e. to indicate that two things are the same. But with its strict semantics, there is a lot of misuse of owl:sameAs in the wild. Indeed, identity is often relative and depends on the context of use. We therefore propose an approach that enables considering the characteristics of involved datasets to interlink datasets thanks to owl:sameAs statements. The experimental results performed on real-world datasets show that the proposed approach is promising.
ABSTRACT La dépendance des entreprises et des organisations vis-à-vis de leurs systèmes d’informa... more ABSTRACT La dépendance des entreprises et des organisations vis-à-vis de leurs systèmes d’information (SI) n’est plus à démontrer. Cette réalité conduit les décideurs à assurer une qualité acceptable des systèmes d’information. Une des caractéristiques principales de la qualité de ces systèmes est sa nature multidimensionnelle. Stylianou et Kumar [STY 00] caractérisent la qualité des systèmes d’information au moyen de six dimensions : l’infrastructure, les logiciels, les données, l’information, l’administration et le service rendu. En particulier, Stylianou distingue la qualité des données à l’entrée du SI de la qualité de l’information en sortie. Cette nuance n’est pas reprise dans les autres approches qui ne différencient pas données et information en termes de qualité.
For many users or automated agents, working with knowledge graphs may be a complicated task. Inde... more For many users or automated agents, working with knowledge graphs may be a complicated task. Indeed, multiple tools using knowledge graphs rely on semantics to perform at their best. For example, in the context of data integration, some instance matching tools use semantic features such as functional and inverse functional properties or disjoint classes to discover instances that are the same (or not). Hence, in many cases, conducting an exploratory study is required to discover which semantic features are used or defined in a knowledge graph. In this paper, we propose an ontology and a large-scale ontology-based Web service that provides statistics about the use of OWL 2 and RDFS semantic features (e.g. functional properties or subclasses) in knowledge graphs. This will allow a human or automatic agent to choose the most appropriate tool or data for a given task. It also gives the data publishers a clear picture about the semantics they provide to data consumers. These statistics are represented in the form of an RDF graph (with different serialization possibilities), making them easy to use and share.
For many decades, Business Intelligence and Analytics (BI&A) has been associated with relatio... more For many decades, Business Intelligence and Analytics (BI&A) has been associated with relational databases. In the era of big data and NoSQL stores, it is important to provide approaches and systems capable of analyzing this type of data for decision-making. In this paper, we present a new BI&A approach that both: (i) extracts, transforms and loads the required data for OLAP analysis (on-demand ETL) from document stores, and (ii) provides the models and the systems required for suitable OLAP analysis. We focus here, on the on-demand ETL stage where, unlike existing works, we consider the dispersion of data over two or more collections.
Résumé. Le nombre et la taille des graphes de connaissances RDF sont en constante augmentation. P... more Résumé. Le nombre et la taille des graphes de connaissances RDF sont en constante augmentation. Par conséquent, le traitement des données pour des agents (automatisés ou humains) devient de plus en plus difficile. Si plusieurs outils peuvent être utilisés pour une tâche donnée, mais qu’ils dépendent chacun à des degrés divers de la sémantique disponible dans le graphe de connaissances, alors il est important d’avoir un aperçu en amont du graphe pour sélectionner le meilleur outil pour cette tâche. Nous avons mené, à grande échelle, une étude approfondie pour vérifier la présence de sémantique dans les graphes de connaissances publiés actuellement dans le Web de données (Linked Data). Bien que certains graphes de connaissances utilisent la sémantique OWL 2, beaucoup ne le font pas ou partiellement. Nous proposons donc une approche qui, en se basant sur des statistiques, instancie une ontologie facilitant la sélection de l’outil le mieux adapté à une tâche donnée en fonction de l’util...
Information system (IS) quality can be characterized as a multidimensional system. It encompasses... more Information system (IS) quality can be characterized as a multidimensional system. It encompasses software quality as well as data quality. It also comprises model quality, service quality, process quality, and more generally IS quality. Modeling several aspects of IS quality leads to specific ontologies. To the best of our knowledge, there is no global ontology dedicated to all the dimensions of an IS. A single ontology federating all the aspects of quality is not available. The aim of this paper is to propose and discuss the main constituents of an ontology of quality federating all the aspects of IS components quality (software, data, models, etc.). In order to operationalize the proposed ontology, we describe an approach allowing us to use the ontology in order to achieve specific quality goals.
L’article decrit la problematique et les solutions proposees par le pro-jet QUADRIS (ARA-05MMSA-0... more L’article decrit la problematique et les solutions proposees par le pro-jet QUADRIS (ARA-05MMSA-0015)dont l’objectif est d’offrir un cadre d’evaluation de la qualite dans les systemes d’information multisources (SIM). Ce cadre a permis de definir un meta-modele pour etudier en particulier les inter-dependances entre les dimensions de la qualite d’un modele conceptuel de don-nees et celles de la qualite des donnees instanciant ce modele. Nous etudions la possibilite de definir des patterns d’evaluation de la qualite dans le but de : 1)formaliser les correlations entre les facteurs de qualite, 2) representer les processus, et 3) analyser la qualite des donnees, du systeme et son evolution. Le projet QUADRIS s’est engage a valider ses propositions dans les trois domaines d’application suivants : le domaine biomedical, le domaine commercial et le domaine geographique.
This paper presents an approach integrating data quality into the business intelligence chain in ... more This paper presents an approach integrating data quality into the business intelligence chain in the context of CRM applications at EDF (Electricite de France), the major electricity company in France. The main contribution of this paper is the definition and instantiation of a generic multi-dimensional star-like model for storing, analyzing and capitalizing data quality indicators, measurements and metadata. This approach is illustrated in one of EDF's CRM applications implementing the data quality-driven information supply chain for business intelligence where the role of the data quality expert is highly emphasized.
Avec la complexification des systemes d?information (systemes ubiquitaires, entreprises ouvertes ... more Avec la complexification des systemes d?information (systemes ubiquitaires, entreprises ouvertes etc.), de nombreux nouveaux langages de modelisation sont proposes. Face a ce developpement de langages specifiques, on peut s?interroger sur la qualite des modeles qui en sont issus. Cet article traite de ce probleme en tirant les lecons de nos experiences passees. Elles mettent en evidence les besoins d?outillage automatise pour l?evaluation de la qualite de modeles, la participation conjointe des differentes parties prenantes dans le processus d?evaluation, et la necessite d?envisager une veritable ingenierie des langages et des modeles centree sur l?humain.
This paper presents an approach integrating data quality into the business intelligence chain in ... more This paper presents an approach integrating data quality into the business intelligence chain in the context of customer-relationship management (CRM) applications at EDF (Electricite de France), the major electricity company in France. The main contribution of this paper is the definition and instantiation of a generic multi-dimensional star-like model for storing, analyzing and capitalizing data quality indicators, measurements and metadata. This approach is illustrated through one of EDF's CRM applications, implementing domain-specific quality indicators and providing quality-driven information management as a business intelligence chain. The role of the data quality expert is highly emphasized.
Un grand nombre de definitions du systeme d'information (SI) ont ete proposees dans la litter... more Un grand nombre de definitions du systeme d'information (SI) ont ete proposees dans la litterature. Il est maintenant bien accepte par les communautes manipulant ce concept que le systeme d'information combine un systeme d'information numerique et des activites humaines afin de servir de support aux branches operationnelles, a la gestion et a la prise de decision dans l'entreprise. Assurer la qualite du SI est donc crucial. Cependant, evaluer cette qualite est une tâche multidimensionnelle complexe, concernant differentes ressources (processus, donnees, artefacts, interfaces, ...), considerant differentes caracteristiques (utilisabilite, fonctionnalites, ...) et mettant en oeuvre differentes methodes pour effectuer l'evaluation. Apres plusieurs annees d'experience dans l'evaluation des SI, nous resumons dans ce papier les lecons que nous avons tirees des evaluations que nous avons menees, et exhibons les limites de l'evaluation d'un SI. Nous finis...
Proceedings of the Ninth International Conference on Enterprise Information Systems
Ensuring and maximizing the quality and integrity of information is a crucial process for today e... more Ensuring and maximizing the quality and integrity of information is a crucial process for today enterprise information systems (EIS). It requires a clear understanding of the interdependencies between the dimensions characterizing quality of data (QoD), quality of conceptual data model (QoM) of the database, keystone of the EIS, and quality of data management and integration processes (QoP). The improvement of one quality dimension (such as data accuracy or model expressiveness) may have negative consequences on other quality dimensions (e.g., freshness or completeness of data). In this paper we briefly present a framework, called QUADRIS, relevant for adopting a quality improvement strategy on one or many dimensions of QoD or QoM with considering the collateral effects on the other interdependent quality dimensions. We also present the scenarios of our ongoing validations on a CRM EIS.
The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate i... more The quality of a Knowledge Graph (also known as Linked Data) is an important aspect to indicate its fitness for use in an application. Several quality dimensions are identified, such as accuracy, completeness, timeliness, provenance, and accessibility, which are used to assess the quality. While many prior studies offer a landscape view of data quality dimensions, here we focus on presenting a systematic literature review for assessing the completeness of Knowledge Graph. We gather existing approaches from the literature and analyze them qualitatively and quantitatively. In particular, we unify and formalize commonly used terminologies across 56 articles related to the completeness dimension of data quality and provide a comprehensive list of methodologies and metrics used to evaluate the different types of completeness. We identify seven types of completeness, including three types that were not previously identified in previous surveys. We also analyze nine different tools capable of assessing Knowledge Graph completeness. The aim of this Systematic Literature Review is to provide researchers and data curators a comprehensive and deeper understanding of existing works on completeness and its properties, thereby encouraging further experimentation and development of new approaches focused on completeness as a data quality dimension of Knowledge Graph.
Digital music scores are a way to present music notation and lack of semantic information useful ... more Digital music scores are a way to present music notation and lack of semantic information useful for musicology purposes in order to manipulate music concepts. We propose a general approach to extend score encodings with semantic annotations. It relies on an ontology of music notation designed to integrate semantic music elements either extracted or produced by a knowledge process. We illustrate the whole mechanism by extracting RDF facts based on the identification of dissonances in Renaissance counterpoint.
The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bri... more The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bring high data quality to the data warehouse from both internal and external sources using the ETL process. The latter is complex and time-consuming as it manages data with heterogeneous content and diverse quality problems. Ensuring data quality requires tracking quality defects along the ETL process. In this paper, we present the main ETL quality characteristics. We provide an overview of the existing ETL process data quality approaches. We also present a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions. To illustrate our study, we carry out experiments using an ETL dedicated solution (Talend Data Integration) and a data quality dedicated solution (Talend Data Quality). Based on our study, we identify and discuss quality challenges to be addressed in our future research.
Abstract With an increase in the number of Linked Open Data datasets, insufficient interlinking q... more Abstract With an increase in the number of Linked Open Data datasets, insufficient interlinking quality can lead to a decrease in overall data quality. Therefore, it is necessary to keep the interlinking quality as high as possible. One of the main ways to link datasets is to use owl:sameAs links, i.e. to indicate that two things are the same. But with its strict semantics, there is a lot of misuse of owl:sameAs in the wild. Indeed, identity is often relative and depends on the context of use. We therefore propose an approach that enables considering the characteristics of involved datasets to interlink datasets thanks to owl:sameAs statements. The experimental results performed on real-world datasets show that the proposed approach is promising.
ABSTRACT La dépendance des entreprises et des organisations vis-à-vis de leurs systèmes d’informa... more ABSTRACT La dépendance des entreprises et des organisations vis-à-vis de leurs systèmes d’information (SI) n’est plus à démontrer. Cette réalité conduit les décideurs à assurer une qualité acceptable des systèmes d’information. Une des caractéristiques principales de la qualité de ces systèmes est sa nature multidimensionnelle. Stylianou et Kumar [STY 00] caractérisent la qualité des systèmes d’information au moyen de six dimensions : l’infrastructure, les logiciels, les données, l’information, l’administration et le service rendu. En particulier, Stylianou distingue la qualité des données à l’entrée du SI de la qualité de l’information en sortie. Cette nuance n’est pas reprise dans les autres approches qui ne différencient pas données et information en termes de qualité.
Papers by Samira Cherfi