10 1016j Giq 2016 02 001

GOVINF-01153; No.
of pages: 13; 4C:

Government Information Quarterly xxx (2016) xxx–xxx
Contents lists available at ScienceDirect
Government Information Quarterly
journal homepage: www.elsevier.com/locate/govinf
Open data quality measurement framework: Definition and application

to Open Government Data
Antonio Vetrò ⁎, Lorenzo Canova, Marco Torchiano, Camilo Orozco Minotas,
Raimondo Iemma, Federico Morando
Nexa Center for Internet & Society, DAUIN, Politecnico di Torino, Italy
a r t i c l e i n f o a b s t r a c t
Article history: The diffusion of Open Government Data (OGD) in recent years kept a very fast pace. However, evidence from
Received 5 March 2015 practitioners shows that disclosing data without proper quality control may jeopardize dataset reuse and nega-
Received in revised form 1 February 2016 tively affect civic participation. Current approaches to the problem in literature lack a comprehensive theoretical
Accepted 3 February 2016
framework. Moreover, most of the evaluations concentrate on open data platforms, rather than on datasets.
Available online xxxx
In this work, we address these two limitations and set up a framework of indicators to measure the quality of
Keywords:
Open Government Data on a series of data quality dimensions at most granular level of measurement. We vali-
Open Government Data dated the evaluation framework by applying it to compare two cases of Italian OGD datasets: an internationally
Open data quality recognized good example of OGD, with centralized disclosure and extensive data quality controls, and samples of
Government information quality OGD from decentralized data disclosure (municipality level), with no possibility of extensive quality controls as
Data quality measurement in the former case, hence with supposed lower quality.
Empirical assessment Starting from measurements based on the quality framework, we were able to verify the difference in quality: the
measures showed a few common acquired good practices and weaknesses, and a set of discriminating factors
that pertain to the type of datasets and the overall approach. On the basis of this evaluation, we also provided
technical and policy guidelines to overcome the weaknesses observed in the decentralized release policy, ad-
dressing specific quality aspects.
© 2016 Elsevier Inc. All rights reserved.
1. Introduction actors, ranging from companies to Non-Governmental Organizations,

from developers to simple citizens. Many suggest that wider and easier
Open data are data that “can be freely used, modified, and shared by circulation of public datasets could entail interesting (and even unex-
anyone for any purpose” (Web ref. 1). Compared to proprietary frame- pected) forms of reuse, also for commercial purposes (Vickery, 2011),
works, digital commons such as open data are characterized — from and in general improve transparency of public institutions (Stiglitz,
both a legal and a technical point of view — by lower restrictions applied Orszag, & Orszag, 2000; Ubaldi, 2013) and distributed ability to inter-
to their circulation and reuse. This feature is supposed to ultimately fos- pret complex phenomena (Janssen, Charalabidis, & Zuiderwijk, 2012).
ter collaboration, creativity and innovation (Hofmokl, 2010). From the point of view of data reusers, we should take into account
At all administrative levels, the public sector is one of the major pro- the role of the so-called infomediaries, i.e., players that are able to inter-
ducers and holders of information, which ranges, e.g., from maps to pret data and present them effectively to the general public (Mayer-
companies registers (Aichholzer & Burkert, 2004). During the last Schönberger & Zappia, 2011), with a highly diversified set of business
years, the amount and variety of open data released by public adminis- models (Zuiderwijk, Janssen, & Davis, 2014). More in general, open
trations across the world has been tangibly growing (see, e.g., the Open data consumers manipulate data in many ways, ranging, e.g., from
Data Census by the Open Knowledge Foundation (Web ref. 2)), while data integration to classification, also depending on the complementary
increased political awareness on the subject has been translated in assets they hold (Ferro & Osella, 2013). Considering this, legal and
regulation, including the revised of the EU Directive on Public Sector technical openness of datasets is not sufficient, by itself, to create a pro-
Information reuse in 2013, as well as national roadmaps and technical lific reuse ecosystem (Helbig, Nakashima, & Dawe, 2012): failures in
guidelines. Releasing public sector information as open data can provide providing good quality information might impair not only the reuse of
considerable added value, meeting a demand coming from all kinds of the data, but also the usage of the institutional portals (Detlor, Hupfer,
Ruhi, & Zhao, 2013). Attempts to increase meaningfulness and reusabil-
⁎ Corresponding author at: Nexa Center for Internet & Society, Politecnico di Torino, Via
ity of public sector information also imply representing and exposing
Pier Carlo Boggio, 65/A, 10138 Torino, Italy. data so that they can be easily accessed, queried, processed and linked
E-mail address: [email protected] (A. Vetrò). with other data with no restrictions (Sharon, 2010). Although sharing
http://dx.doi.org/10.1016/j.giq.2016.02.001
0740-624X/© 2016 Elsevier Inc. All rights reserved.
Please cite this article as: Vetrò, A., et al., Open data quality measurement framework: Definition and application to Open Government Data,
Government Information Quarterly (2016), http://dx.doi.org/10.1016/j.giq.2016.02.001
2 A. Vetrò et al. / Government Information Quarterly xxx (2016) xxx–xxx
common functionalities, the most used softwares for open data publica- set of metrics. Then, we apply the framework (Section 4) to observe dif-
tion may adopt different approaches to ensure the above (Iemma, ferences in quality between OGD released at national level (obtained in-
Morando, & Osella, 2014). tegrating data from different regions, with presumably higher data
On top of these considerations, it is necessary to consider that low- quality), and OGD at municipality level and therefore characterized
quality data provision increases the cost (in its wider meaning) of by decentralized disclosure (and presumably lower data quality). We
accessing and interpreting data: it is this cost and not the poor quality package the results (Section 5) in a set of acquired good practices, weak-
of Open Government Data (OGD) in itself that motivates the paper at nesses to overcome and discriminating factors; in addition, we comple-
hand. These additional costs imputable to low quality data depend on ment these results with a list of technical guidelines in terms of tools,
multiple factors, such as the nature of the data or the type of use and processes and research directions (Section 6) to improve quality of
users (Kim, 2002). Such costs have been reported in the literature main- data whenever a centralized disclosure of data is not possible due to
ly for enterprises, affirming that high quality data are fundamental fac- costs and nature of data. We discuss the limitation of our approach in
tors to a company's success (Madnick, Wang, & Xian, 2004; Haug, Section 7, and provide our roadmap and recommendations for future
Pedersen, & Arlbjørn, 2009; Even & Shankaranarayanan, 2009). This ev- work in Section 8.
idence has been recently reported also for Government Data (Zuiderwijk
& Janssen, 2015), whose disclosure policies have spread worldwide only 2. Related work
after USA 2005 new guidelines on the Freedom of Information Act
or after the presidency of the U.S.A. issued the 2013 Executive Order 2.1. OGD quality problems reported by practitioners
“Making Open and Machine Readable the New Default for Government
Information“. Even Italy, which was one of the early adopters of the Several studies provided examples that contribute to show how is-
“Open by Default” principle, made an explicit policy only 2012 (Decree sues related to poor data quality can be widespread, and potentially
No. 179). As a consequence, our logical argumentations are by construct hamper an efficient reuse of open data.
undetermined by published data on the negative impact of bad OGD Allison (2010) reports problems of accuracy, aggregation and preci-
quality (see, for instance: (Web ref. 3), (Web ref. 4)). Therefore it is sion in Open Government Data, such as bad manual transposition of zip
self evident that when low-quality data are released as open data, codes in public archives. Another example comes from the monthly re-
reuse will be discouraged, and/or several re-users will invest in checking ports of the American Nuclear Regulatory Commission, where spent fuel
and increasing such quality in a decentralized and uncoordinated way: quantities are recorded: data were bouncing both ways from December
understanding poorly documenting datasets and performing data 1982 to May 1983, while the trend for such type of data must have been
cleansing activities represent a significant proportion of the effort neces- only positive (Barlett & Steele, 1985).
sary to reuse Open Data. This represents a waste of resources. Public ad- Aggregation problems were reported on FedSpending.org, that
ministrations are the cheapest cost avoider with respect to the social keeps records on federal contract and grant data: data about companies
waste represented by the poor quality of OGD, and they are supposed that acquired new parent companies were wrongly aggregated, making
to be players with a stronger incentive in investing for the public good: impossible to track the money received after mergers or spin-offs.
increasing the quality of OGD, they could foster reuse and focus the re- Another example of poor aggregation and insufficient details comes
sources of re-users on added value services. However, assessing the from a project by the Sunlight Foundation called Fortune 535, in
quality of OGD is (one of) the necessary preliminary step(s) to motivate which the organization used the personal financial disclosure forms
public administrations to invest on improving OGD quality. that every U.S. Member of Congress is obliged to fill since 1978. Data
This motivates our study, together with the fact that re-users will di- were collected in ranges of income (e.g., from $1 to $1000, and from
rectly benefit from a higher quality of OGD. In the paper at hand, we $1001 and $15,000), but the Congress changed these ranges several
suggest that a widely adopted data quality model for open data, and a times, making it virtually impossible to create consistent time series
set of actionable metrics are necessary tools to achieve data quality (e.g., to analyze which members of Congress accumulated richness
improvement, as it has been recently advocated in the literature during public service).
(Zuiderwijk & Janssen, 2015) (Umbrich, Neumaier, & Polleres, 2015), Other examples concern lack of integration. For instance, Tauberer
contributing to unlock the potential of reuse. Formally, Data Quality is (2012) reports that the two chambers of the U.S. Congress disclosed
defined in the ISO 25012 Standard as “the capability of data to satisfy their data with completely different schema and IDs for Members of
stated and implied needs when used under specified conditions”(ISO Congress, making data integration very difficult: as a consequence,
25012). In addition to this definition, Data Quality is often defined merging or comparing data requires significant extra-effort, which
in the literature as “fitness for use” (Wang & Strong, 1996) (Batini, could be easily avoided by achieving a better coordination between
Cappiello, Francalanci, & Maurino, 2009), i.e., the ability of a data collec- the data holders.
tion to meet users' requirements: thus, “fitness for use” emphasizes the Sunlight's labs director Tom Lee reports data quality problems in a
importance of taking a consumer viewpoint of quality. blog post entitled “How Not to Release Data” (Web ref. 5). The data
In this context, we built an evaluation framework based on the anal- about White House e-mail records was released in form of printouts
ysis of the methodologies for data quality measurement documented in from the record management system (ARMS) and then transmitted
literature, and tailored according to users' needs. Afterwards, we used via fax to Clinton library and re-digitized through OCR. At this point
the resulting evaluation framework to compare the quality of two sam- the document was encoded in PDF and released. The result was badly
ples of OGD: in the first case, data are released in a decentralized way formatted, duplicative and missing information. Moreover, in a recent
(i.e., with no common structure) by different local administrations; in analysis performed by the authors of this paper on open datasets re-
the second, data structures are standardized at governmental level leased by the City of Torino (Italy), problems regarding the absence of
(both cases are samples from Italian Open Government Data). The metadata, not reported measuring system for geographical information
comparison has been driven by the following research question: is the and missing data (Web ref. 6), were identified. Information quality
application of a metric-based evaluation framework for OGD quality able issues can affect also re-users in the academic field: for example
to detect acquired good practices, weak aspects and discriminating factors Whitmore (2014) reported information quality issues (mainly inaccu-
between samples of OGD from two different disclosure strategies? rate and incomplete data) as one of the barriers impairing the prediction
The remainder of the paper is organized as follows. We provide capabilities of open data from US Department of Defense to predict fu-
motivations and comparison of our approach with related work in ture wars.
Section 2. In Section 3 we build the quality framework supported by a Finally, the Open Knowledge Foundation collected examples of ‘bad’
data quality model from the literature, afterwards we set up an initial data provision, mainly from public data holders (Web ref. 7). Issues
A. Vetrò et al. / Government Information Quarterly xxx (2016) xxx–xxx 3
relate with data structuring, formats, and other poorly user-friendly based upon a theoretical framework because dimensions were not
practices. uniquely defined: as a matter of fact, the computed completeness was
actually defined as availability, while accuracy was related to the format
2.2. Existing approaches to OGD quality measurement of the documents instead of their content.
Finally, regarding timeliness, Atz (2014) proposed the “tau metric”,
The issue of open data quality has been partially addressed in recent i.e. a metric that captures the percentage of datasets up-to-date in a
years. In 2006, Tim Berners-Lee published a deployment scheme for data catalogue, and he applies it to three different portals (World
open data, based on five – incremental and progressively demanding Bank, the UK data catalogue and the London data store). The author
– requirements represented as “stars” (Berners-Lee, 2006). A 5-star computed the metric on the datasets and then aggregated to form a sin-
open dataset should comply with all of these requirements: gle indicator of Timeliness, which also discriminates between new re-
lease and minor updates. Results indicated that in two portals only
1. Available on the web, any format provided data has an open license;
about half of the datasets were updated according to their schedule
2. Available as machine-readable structured data (e.g. Excel instead of
and the nature of the contained data, while in the third one only one
image scan);
fourth did. Notwithstanding the different metric constructions, these
3. Available non-proprietary format (e.g. CSV instead of Excel);
findings are similar to those of Maurino et al. (2014).
4. Make use of open standards from W3C (RDF and SPARQL) and URIs
Our work takes inspiration and further motivations from these stud-
to identify things;
ies and tries to overcome their limitations. In particular we aimed to ad-
5. Link data to other providers' data to provide context.
dress the following gaps:
Although widely cited, this schema proposed by Tim-Berners Lee
covers only a specific quality aspect, i.e., the format or encoding used a) We observed that existing studies in literature lack reference to a
to publish the data. As a consequence, a dataset can reach the 5 star comprehensive theoretical framework with univocal definitions of
level while showing at the same time poor quality. For instance, a com- quality characteristics: we propose an evaluation framework build
mon problem is data accuracy: for example typing or syllabling errors on top of a quality model adapted from the literature of data quality
when data input is manual, or associating values to the wrong instances with univocal definitions of quality characteristics.
because of a software misbehavior. In addition, as we have seen in b) We observed that existing works in literature assess the quality of
Section 2.1, the quality of open data does not concern accuracy only, OGD mainly at portal level: we define a set of quantitative indicators,
but also other characteristics: such as completeness, consistency and some of them specific to OGD, for a subset of the available quality
timeliness (Catarci & Scannapieco, 2002). In the literature of data characteristics and at the most granular level of measurement,
quality, several authors contributed to build an extensive list of data which is cell level (according to tabular representation) or dataset
quality characteristics: see for instance Wand and Wang (1996), level when otherwise not possible. We assess the suitability of the
Wang and Strong (1996), Redman (1996), Jarke, Lenzerini, Vassiliou, metrics by using them to compare a sample of OGD from two differ-
and Vassiliadis (1995); Bovee, Srivastava, and Mak (2003), Naumann ent disclosure strategies, to reveal their acquired good practices,
(2002), and Batini & Scannapieco (2006). However one problem we no- weak aspects and discriminating factors.
ticed is that usually different definitions to same quality characteristics
have been given. Another issue we found is that only a few assessments The approach we propose is similar to the one adopted by Behkamal,
have been done so far, and the different definitions of the quality char- Kahani, Bagheri, and Jeremic (2014) for Open Linked Data quality: the
acteristic under study resulted in different metrics used to measure the authors investigated a set of quality characteristics taking as reference
same characteristics, as we are going to report herein. the ISO25012 standard data quality model. The authors build a set of
Ubaldi (2013) proposed an analytical framework to assess of Open 20 metrics related to semantic and syntactic accuracy, uniqueness, com-
Government Data (OGD). The author developed a large set of metrics pleteness and consistency. They verify the suitability of the proposed
at very heterogeneous points of view (e.g., political, organizational, both with a theoretical validation and an empirical one. From the theo-
technical). Data quality is measured in terms of availability (e.g., as retical point of view, all of the metrics respect four out of five desirable
number of datasets and metadata available on a specific portal), de- properties, namely non-negativity, null value, symmetry and monoto-
mand (e.g., number of views per day), and re-use (e.g., number of nicity, but not additivity. However, being additivity a special case of
apps developed with the data). Although this work is probably the monotonicity, the authors state that the satisfaction of the monotonicity
very first comprehensive approach for the assessment of open data, property makes them acceptable for their intended usage. The results of
from the technical point of view it remains quite at high level, because the empirical evaluation lead to the exclusion of four not discriminative
all metrics proposed are at portal level. In addition, no evaluation of metrics (ratio of syntactically incorrect triples, ratio of instances being
the metrics is available. members of disjoint classes, ratio of functional properties with different
A more detailed study was conducted by Maurino, Spahiu, Batini, values, invalid usage of inverse-functional properties), and to the obser-
and Viscusi (2014): the authors analyzed 50 datasets from Italian OGD vation that a dataset with higher number of similar properties is highly
at various administrative levels (regions, provinces, municipalities) in likely to have more triples using these properties, while using similar
terms of completeness, accuracy and timeliness. Completeness is com- properties have an inverse relation with the inconsistency of data values
puted with respect to the availability and accessibility of the document in a dataset.
through internal or external links, accuracy in terms of its format (using With respect to the approach of Behkamal et al. (2014), our work
a 3 level scale instead of the standard de facto 5-stars from Berners-Lee, differentiates in terms of quality model and target data. The authors
2006), while timeliness as the presence or absence of updates. The mea- refer to a data quality model (“SQUARE” — ISO25012), which is a subset
surements proposed by Maurino et al. (2014) are at dataset level, but of the one we considered (SPDQM, i.e. SQUARE Aligned Portal Data
the evaluation is performed at portal level by aggregating the values Quality Model). In addition, their metrics are specific to linked data,
computed on each dataset. The authors observed that about 40% of which are in form of triples while our metrics apply to tabular data:
region and municipality portals were not complete, i.e. did not make Linked Data are only a fraction of the whole disclosed Government
available the data requested by law, against 26% of the province portals. Data. More specifically, in Italy, public datasets in RDF format are 1975
Regarding accuracy, the percentage of documents opened in a not while datasets in csv format (thus, tabular) are 6471 (Web ref. 8).
machine-readable format ranged between 40% and 55%. Similar per- In the following section we report how we select the theoretical
centages were reported for timeliness. While we believe that this eval- framework and we operationalize it with metrics, comparing with the
uation shed a light on data quality problems of Italian OGD, it was not literature work.
3. Open Government Data quality measurement framework The most commonly reported quality problems (see also Section 2.1)
correspond to either intrinsic properties of the data or properties
Our approach to define an open data quality measurement frame- depending on the context or the usage (Bovee et al., 2003). Since OGD
work consists of three parts: typically span heterogeneous domains and they are subject to the
most diverse usage from their consumers, our opinion is that it is prefer-
– identification of the most suitable data quality model as theoretical able to select the dimensions that address the intrinsic aspects of data
support of the measurement framework (Subsection 3.1), quality. In this viewpoint, SPDQM contains the most complete set of
– methodology for the selection of data quality characteristics and characteristics (12) when compared to the other models. In addition,
metrics (Subsection 3.2), SPDQM presents a set of basic characteristics shared by almost all the
– results on the selection of data quality characteristics and metrics models (accuracy, completeness, timeliness), and provides characteris-
(Subsection 3.3). tics as traceability, compliance and understandability, which are less
considered by other frameworks, but are nonetheless important for
3.1. Data quality model selection OGD.
Thus, considering the above, we used the SPDQM as theoretical sup-
There are, available in the literature, several theoretical frameworks, port for our framework.
intended as set of metrics, characteristics and dimensions, for assessing
data quality (DQ), each of these being usually part of a data quality 3.2. Quality characteristics and metric definition: methods
methodology. Such methodologies are typically defined as sequences
of activities that can be divided into three phases (Batini et al., 2009): The next step is the selection of a subset of quality characteristics
from SPDQM and the definition of metrics for those characteristics.
1. State Reconstruction: which collects contextual information about
Concerning the selection of a sub-set of quality characteristics from
the data, like organizational process and services, quality issues and
the full set defined by SPDQM, we used the results from a survey
corresponding costs, data collections and related procedures.
previously conducted by the authors. The survey was conducted in
2. Assessment/measurement: which aimed at measuring the quality of
2013 among the participants of two hackatons aimed at reusing
data with respect to some relevant quality dimensions.
OGD. We collected answers from 15 developers. We focused on two
3. Improvement: which includes all the steps and procedures for
items (out of 14) in the questionnaire used for that survey, the
attaining higher data quality targets.
whole instrument and the measures are reported in Appendix B.1.
Our focus is on phase 2. Batini et al. (2009) collected from the liter- The analyzed item (“Which problems did you find working with
ature an exhaustive list of data quality methodologies, that are reported open data?”, “Which aspects of data quality would you like to im-
in Table 1. prove?”) collects the issues as reported by practitioners. We built a
This list does not contain a relevant model, which is the Square- complete list of the most common issues starting from the answers
Aligned Portal Data Quality Model (SPDQM, Moraga, Moraga, Calero, & and mapped them onto the data quality characteristics of SPDQM.
Caro, 2009), because it was published shortly after the book of Batini This approach, though suffering from a limited generalizability, is far
et al. (2009). The SPDQM was built upon the Portal Data Quality less biased than a selection based solely on the personal believing of
Model (PDQM, Calero, Caro, and Piattini (2008)) and the SQuaRE (ISO/ the research team.
IEC 25012, 2008) standard, both also not contained in the list of Concerning the definition of the metrics, we relied on the principles
Table 1. SPDQM contains a set of 42 characteristics (30 characteristics of Kaiser, Klier, and Heinrich (2007):
from PDQM, 7 characteristics from SQUARE, the remaining 5 character-
istics were added after a systematic literature review), which are orga- – Measurability: the metrics should be normalized and at least interval
nized in two viewpoints and four categories: scaled.
– Interpretability: the metrics have to be comprehensible. Their
– Inherent definition should have right amount of information in order to be
◦ Intrinsic: This denotes that data have quality in their own right interpretable.
– System dependent – Aggregation: it should be possible to quantify data quality at an attri-
◦ Operational: The data must be accessible but secure bute level, as well as tuple, dataset or database level. In this way met-
◦ Contextual: Data Quality must be considered within the context of rics would have a semantic consistency on all the levels. Moreover
the task in hand the metrics should permit value aggregation at a certain level in
o Representational: Data must be interpretable, easy to understand, order to obtain the metric at a higher level.
concisely, and consistently represented. – Feasibility: in order for the metrics to be applicable in a practical
way, they should be based on determinable input parameters and
be preferably automatable.
Table 1
Methodologies considered in Batini et al. (2009).
In addition to these indications, metrics can be classified as either
Methodology acronym Extended name
objective, when they are based on quantitative metrics, or subjective,
TDQM Total data quality management when they are based on qualitative evaluations from users (like in
DWQ The Data Warehouse Quality methodology surveys). In this work we will give more emphasis to quantitative
TIQM Total information quality management
AIMQ A methodology for information quality assessment
measures: quantitative indicators are important for triangulation with
CIHI Canadian Institute for Health Information methodology assessments based on qualitative measures, like questionnaires or
DQA Data quality assessment experts' opinions, which suffer from subjectivism and inconsistency.
IQM Information quality measurement With such a list of desired requirements at hand, we searched the liter-
ISTAT ISTAT methodology
ature for metrics on the selected SPDQM characteristics that satisfy
AMEQ Activity-based Measuring and Evaluating of product
information Quality (AMEQ) methodology them, and when no metric was found, we formulated it ex-novo. Also,
COLDQ Loshin methodology (cost–effect of low data quality) when possible this was done taking as unit of measure the cell of the
DaQuinCIS Data Quality in Cooperative Information Systems dataset, when not, the metrics are at dataset level. We took into account
QAFD Methodology for the quality assessment of financial data both the definition of quality characteristic and, whenever possible, the
CDQ Comprehensive methodology for Data Quality management
type of problem reported by developers.
3.3. Quality characteristics and metric definition: results therefore decentralized data disclosure, with no possibility of extensive
quality controls as in the former case, hence with supposed lower qual-
Table 2 summarizes the type of problems emerging from the survey ity. In fact, considering the workflow leading to the publication of open
and links them to the data quality characteristics of SPDQM. The map- data, a difference in the quality of the published data may be due to a
ping was achieved by comparing the definition of the characteristics difference in the quality of the original datasets or to a difference in
with the issues highlighted by developers. The classification was agreed the publication pipeline. The original dataset object of the study, is in
upon in a meeting involving four of the authors of this work. both cases released by municipalities: therefore, it is safe to assume
Part of the mapping was straightforward: the issues “Incomplete that the quality of data at the source is comparable. However, in the
data”, “Lack of data source traceability”, “Incongruent data”, “Errors” Open Coesione case, while data are aggregated at regional level, they
and “High time to understand data” fit very well to quality characteris- also undergo a quality improvement process (see Section 4.2 for
tics of SPDQM. For the other mappings a few further words have to be details): therefore, we can assume that if any quality difference is mea-
spent. The issue “Out-of-date data” could refer both to time validity sured between the two samples, it is due to the publication pipeline.
and data obsolescence, for this reason in Table 2 refers to both expira- In this way, we are able to understand the suitability and limitations
tion and currentness. Some discussion has to be done for issue “Lack of the metrics in our framework to correctly identify the quality differ-
of Metadata”. Answers to the questionnaire showed that developers ences. We formalize our goal with the Goal Question Metric template
encountered understandability problems. In our theory, this happens (Basili, Caldiera, & Rombach, 1994):
due to poor metadata that do not provide useful guidance. Although
we could not test this cause–effect relationship, we believe that it is Object Open Government Datasets
safe and reasonable to map the code “Lack of Metadata” with Under- Purpose Understand acquired good practices, weak aspects, and discriminating
standability, given that also in the literature metadata are considered factors
Focus Intrinsic data quality
fundamental for the right comprehension of the dataset (see Reiche
Stakeholder Data releaser, data user
and Hofig (2013)). In addition, “Lack of Metadata” is also mapped to Context Open Government Data
compliance due to the existence of a standard for metadata in Open
Government Data sets. Finally, the issue “Format not compliant to well
known standard” had no clear corresponding quality characteristic: Our resulting research question is:
we also mapped it to compliance and we measured it with the compli-
ance to the 5 star Open Data format scheme from Tim Berners-Lee Is the application of a metric-based evaluation framework for OGD
(2006). quality able to detect acquired good practices, weak aspects and dis-
Table 3 contains the metrics we defined for each of the selected qual- criminant factors between samples of OGD from two different disclosure
ity attributes, reporting name and descriptions. The formulas used to strategies?
compute them and the literature references are shown in Appendix A,
while specific tuning of the implementation for the datasets analyzed 4.2. Datasets analyzed
are provided later in Table 5, when the data sources are also presented.
Open Coesione contains data about the fulfillment of the invest-
4. Application of the framework: material and methods ments and related projects by the Italian central government and re-
gions using the 2007–2013 European Cohesion funds. At the time of
4.1. Goals and object of the study this writing, the portal contains data about 900,000 projects, worth
90 billion Euros of financings. Data and information about territorial
Our aim is to use the previously defined evaluation framework to cohesion policies concern projects fundamentals, amount of funding, lo-
quantitatively assess the quality of Open Government Data. To achieve cations, involved subjects and completion times. The information about
our goal, we needed an oracle for OGD of high quality. For this reason the 2007–2013 funds is gathered by a single monitoring system man-
we selected an internationally recognized good example of OGD, aged by the General State Accounting Department (RGS — Ragioneria
i.e. the Open Coesione project, which has been ranked 4th at the 2014 Generale dello Stato) of Italy's Ministry of Economy and Finance. The
Open Government Awards (Web ref. 9). We chose Open Coesione and monitoring system entails a data integration process and a data quality
not the first classified for two reasons: it was the representative project control process to homogenize the information coming from the differ-
for Italy and we had a direct contact with its managers at the Italian ent Italian administrative regions. The data quality control process con-
Ministry General State Accounting Department of Italy's Ministry of cerns mainly the identification codes of the projects and the financial
Economy and Finance. Open Coesione data has been disclosed with a and time variables: checks are done on data completeness and correct-
centralized release strategy with extensive quality controls. We com- ness, and on the consistency between different variables. In particular,
pared Open Coesione with samples of OGD from municipalities and regarding consistency, there are 30 controls, which could determine
the deletion of a project from the dataset, and 14 checks that arise a
Table 2 warning: a sample control on six funding programs and six regions
Problems found in the exploratory survey. resulting in more than 70K projects has generated about 18K drop
outs and about 110K warnings. Most of the problems were related to
Problem found in survey Related quality Intrinsic System-dependent
characteristic inconsistencies of project identifiers in different datasets, missing
data, payments that overcome the available funding, and inconsistent
Incomplete data Completeness X
Format not compliant to Compliance
addresses.
well known standard The specific dataset object of this study contains the projects funded
Lack of data source Traceability X X in the European Social Fund 2007/2013 (Web ref. 10). We used the
traceability snapshot of 31st December 2013, which contains about 55 million
Incongruent data Consistency X
observations of 75 variables (approximately 1 GB size). Variables
Out-of-date data Expiration, currentness X
Lack of Metadata Compliance, concerns: projects identification data; projects thematic classification
understandability according to Italian and European schemas; funding programs; financial
Errors Accuracy X data (funding and payments); dates; control variables used in the qual-
High time to understand Understandability X X ity checks. After discussing with the stakeholders at the RGS, we focused
data
our analysis only on the variables not included in their quality check
Table 3
Metric definitions and description.
Characteristic Metric Level Description
Traceability Track of creation Dataset Indicates the presence or absence of metadata associated with the process of creation of a dataset.
Track of updates Dataset Indicates the existence or absence of metadata associated with the updates done to a dataset.
Indicates the percentage of rows of a dataset that have current values, it means that they don't have any value
Currentness Percentage of current rows Cell
that refers to a previous or a following period of time.
Indicates the ratio between the delay in the publication (number of days passed between the moment in which
Delay in publication Dataset the information is available and the publication of the dataset) and the period of time referred by the dataset
(week, month, year).
Indicates the ratio between the delay in the publication of a dataset after the expiration of its previous version
Expiration Delay after expiration Dataset
and the period of time referred by the dataset (week, month, year).
Indicates the percentage of complete cells in a dataset. It means the cells that are not empty and have a
Completeness Percentage of complete cells Cell
meaningful value assigned (i.e. a value coherent with the domain of the column).
Percentage of complete rows Cell Indicates the percentage of complete rows in a dataset. It means the rows that don't have any incomplete cell.
Percentage of standardized Indicates the percentage of standardized columns in a dataset. It just considers the columns that represent some
Compliance Cell
columns kind of information that has standards associated with it (i.e. geographic information).
Indicates the degree to which a dataset follows the e-GMS standard (as far as the basic elements are concerned, it
eGMS Compliance Dataset
essentially boils down to a specification of which Dublin Core metadata should be supplied)
Five star Open Data Dataset Indicates the level of the 5 star Open Data model in which the dataset is and the advantage offered by this reason.
Percentage of columns Indicates the percentage of columns in a dataset that has associated descriptive metadata. This metadata is
Understandability Cell
with metadata important because it allows to easily understanding the information of the data and the way it is represented.
Percentage of columns in Indicates the percentage of columns in a dataset that is represented in a format that can be easily understood by
Cell
comprehensible format the users and it is also machine-readable.
Indicates the percentage cells in a dataset that has correct values according to the domain and the type of
Accuracy Percentage of accurate cells Cell
information of the dataset.
Indicates the ratio between the error in aggregation and the scale of data representation. This metric only applies
Accuracy in aggregation Cell for the datasets that have aggregation columns or when there are two or more datasets referring to the same
information but in a different granularity level.
procedure. This subset of variables concerns the amounts of financial for example, the delay in publication is not defined when the publica-
support by the different institutions (European Union, Italian Govern- tion date is missing either on the web site or within the metadata. We
ment, Italian local governments) and project dates: in total 22 variables considered ND data equivalent to 0.
(i.e. columns) and about 16 million data-points. Another special case concerns empty cells in the Open Coesione
Concerning the decentralized data disclosure, we needed more than dataset. Empty cells could be considered either as belonging to the do-
one dataset to reduce as much as possible noise and bias in the data main or not: as a matter of fact, an empty cell in column “obtained
sources, however on comparable topics. We decided to select our sam- funding” could be interpreted as a zero because the project didn't ob-
ple from regional administrative municipalities because at regional level tain any funding yet. After a preliminary analysis of the data in conjunc-
the published data were too diverse and not comparable. Therefore we tion with the data releasers, we decided to compute completeness
searched through the single municipality portals for catalogues contain- metrics including the empty cells in the domain of validity, so an
ing data about the same topic (such as resident citizens and hospitals). empty cell still contribute to computation of the metric (see formula
However, even at that level data were very diverse and we ended up in Appendix A).
our search with datasets about three topics: resident citizens, marriages, As we have seen, although the metrics defined are applicable to
and business activities. Table 4 shows which dataset type was found any dataset in tabular form, their implementation might need small
in which city portal. Details and URLs for each dataset are reported in but necessary tailoring to be applied to specific data domains. This is
Appendix B.2. not a limitation of the metrics but rather a peculiarity of OGD, whose
domains vary a lot and data are heterogeneous. We summarize in
4.3. Analysis methodology Table 5 the dataset-specific refinements that were necessary and the
reasons.
Automatic computation of metrics has been possible for complete- As far as the analysis of results is concerned, for each defined metric
ness (percentage of complete cells, percentage of complete rows) and mi, we built a pair of null/alternative hypotheses:
accuracy (percentage of accurate cells, accuracy in aggregation). For all
other quality dimensions and related metrics, manual computation H0: mi, OPENCOESIONE = mi, MUNICIPALITIES
was necessary. All measures have been normalized to the interval HA: mi, OPENCOESIONE ≠ mi, MUNICIPALITIES.
[0,1] (see Appendix A for details).
The null hypothesis is that there is no difference between the Open
In certain cases metrics were not applicable or undefined. A metric is
Coesione datasets and the sample of municipalities' datasets. We per-
considered not available (NA) if it measures a characteristic that does
form the two tailed Mann Whitney test (Sachs (1982)) with standard
not apply to the dataset under study: for example, the percentage of col-
confidence level of 95% to compare the measurements on the two
umns adhering to a standard can be NA if the type of data contained in
groups (for each metric). As a consequence, if p-value is less then
that column is not regulated by standards. While, a metric is considered
0.05, the null hypothesis is rejected in favor of the alternative one.
not defined (ND) when there is not enough information to compute it:
Due to the low number of datasets, we are aware that the test could
be conservative and only looking at the p-value could be misleading in
Table 4 case of small differences. For this reason we report also the confidence
Datasets used with decentralized disclosure. intervals. Additionally, in order to better adhere to our stated goals
Datasets Torino Roma Milano Firenze Bologna (see Section 4.1), we also run a more pragmatic analysis methodology
and classify the metrics into three categories:
Residents X X X X X
Marriages X X X – Acquired good practices, in relation to metrics that in all datasets of
Business X X X
the analyzed regions/municipalities have measures ≥0.50
Table 5
Metrics refinement.
Metric Dataset-specific refinement Reason
Percentage of syntactically The domain of values may vary from attribute to attribute. Every dataset Not all the attributes have the same domain
accurate cells needs to have specified domains for each attribute
Percentage of complete cells If there is no information in the metadata, some assumptions must Default values may be missing values, while for null values it must
be made about the interpretation of null values and (possible) be checked with stakeholder if they are admitted in the domain of
default values (e.g. date of birth 1/1/1900). the attribute (e.g., in the case of optional values).
Percentage of complete rows The same as in “percentage of complete cells” The same as in “percentage of complete cells”
– Quality aspects to be improved, in relation to metrics that in all which could be adopted to improve the quality of OGD on certain qual-
datasets of the analyzed regions/municipalities have measures b0.50 ity characteristics. This is useful for all those cases where setting up a
– Discriminating factors, i.e. metrics whose measures change signifi- data quality system is economically not convenient (e.g., data is too spe-
cantly across the various data sources and determine in which qual- cific, resources are too scarce).
ity aspects a dataset is different from the others.
6.1. Comments on results
The choice of threshold 0.50 is based on the fact that the resulting in-
dicators are polarized (see Fig. 1) and therefore such a threshold is 6.1.1. Understandability
reliable. Metadata at the lower administration level is an aspect that needs to
be improved, because only in two cases (i.e., Torino/business activities
5. Results and Firenze/marriages) we found metadata associated to the raw data,
and in those two cases all the columns have metadata associated. Re-
We report the measurements obtained in Fig. 1. The first three garding comprehensible format, some cities performed very well,
graphs correspond to the municipality-level datasets, the last one to other obtained a metric score very close to zero: this was the case, for
Open Coesione with data aggregated from all regions. Not Available instance, where unit of measures was not specified.
data in municipalities are reported as an empty cell in the corre-
sponding embedded table (and do not contribute to the evaluation), 6.1.2. Time aspects: currentness and expiration
while Not Defined data is equivalent to 0 (and they do contribute to The metrics on the time aspects were higher in Open Coesione,
the evaluation) as discussed in Section 4.3. mainly because data are published at regular intervals of two months.
We report in Table 6 the average of the measurements and the On the contrary, on the municipality level, the assessment shows that
standard deviations. The comparison between Open Coesione and data is generally current, although it is not always released as soon as
the sample of municipality-data is summarized in Table 7 according to it is available. We also observed many cases where expiration of data
the analysis methodology described in Section 4.3: a “+” indicates an was not defined.
acquired good practice, a “−” indicates an aspect to improved, a “±” a
discriminating factor among the regions/municipalities.
6.1.3. Accuracy
We observe that most of the variability resides in understandability.
The data integration and quality checks in Open Coesione have a
Differences are also in completeness (the only aspect in favor of munic-
clear effect on the metric that measure syntactic accuracy. In the munic-
ipalities), accuracy, traceability, currentness and expiration. Finally,
ipalities values were sometimes inconsistent with their domains. For in-
there are equal levels of traceability, compliance and expiration.
stance, the business activity dataset in Torino had the following domain
Last two columns of Table 7 show the p-value of the Mann–Whitney
for column COD_COMP: {CFE, CF, E, EP}; however values in the corre-
test (in bold when statistically significant) and the confidence interval
sponding column were in the following domain: {AE, CF, EP}. Notably
of the difference between the two groups. Statistically significant differ-
the code AE is not present in the metadata schema.
ence is found in the following metrics: Percentage of complete rows,
percentage of syntactically accurate, track of updates, delay in publica-
tion, delay after expiration, eGMS compliance, and percentage of col- 6.1.4. Completeness
umns with metadata. We observe a border value for percentage of This metric is the only one where the centralized dataset performed
columns in comprehensible format. The confidence intervals reveal worst. There are two possible explanations: either the integration of
that except for percentage of complete rows, all statistically significant data with different structures results in empty cells for certain fields,
differences are in favor of the Open Coesione datasets. which are not present in all regional sources, or this is due to the
type of data stored in Open Coesione. By discussing with the stake-
6. Discussion holders it was possible to understand that most of the empty cells
were due to not started projects, so the second explanation is more
The measurements in Fig. 1 and their summary in Table 6 and reasonable.
Table 7 lead us to the conclusion that the centralized data release, at na-
tional level, resulted in better quality than the decentralized one (at mu- 6.1.5. Traceability
nicipality level). Although this is expected due to the presence of a In both cases the dataset creation information is always present,
quality control process (described in Section 4.2), this gives us confi- however there is no information available on data modifications and
dence on the capability of the framework to capture such quality differ- updates. In Open Coesione, in particular, although updates are made
ence. In addition, by applying the specific metrics defined, we are able to every two months, this does not guarantee that in the new release
understand more precisely in what the instances of the two release data from all regions are actually updated. This is the only dimension
strategies differed in terms of data quality. with a negative assessment in both types of datasets.
In light of these measurements, we interpret the obtained results for
the specific quality characteristics in Sub-section 6.1. Then in Sub- 6.1.6. Compliance
section 6.2, by following a line of reasoning by experience and analogy Both samples obtained same positive results in compliance, which
with the state of the art, we reason on possible tools and processes, means that every time a specific standard was present, it was respected.
Fig. 1. Computation of metrics (top three — municipality level, on the bottom — national level).
Moreover, all datasets analyzed are at level three of Tim-Berners Lee datasets out of 12,084 are at level three at the time this article has
scale (i.e., available, structured, in non-proprietary format). This is the been written. The rest of the distribution is the following: none at
most recurrent situation in Italian OGD (Web ref. 11): in fact, 8418 level one, 72 at level two, 2967 at level four, 627 at level five.
Table 6
Metrics computation: descriptive statistics.
Open Coesione (n = 20) Municipalities (n = 11)
Dimension Metric Average Standard deviation Average Standard deviation
Percentage of complete cells 0.94 0.07 0.92 0.15

Completeness
Percentage of complete rows 0.00 0.00 0.74 0.38
Percentage of syntactically accurate cells 1.00 0.00 0.91 0.18
Accuracy
Accuracy in aggregation 1.00 0.00 1.00 0.00
Track of creation 1.00 0.00 1.00 0.00
Traceability
Track of updates 0,50 0.00 0.23 0.08
Percentage of current rows 1.00 0.00 1.00 0.01
Currentness
Delay in publication 0.97 0.01 0.59 0.35
Expiration Delay after expiration 0.99 0.00 0.26 0.45
Percentage of standardized columns 1.00 0.00 1.00 NA
Compliance eGMS compliance 0.92 0.00 0.88 0.01
Five star Open Data 0.60 0.00 0.60 0.00
Percentage of columns with metadata 1.00 0.00 0.03 0.11
Understandability
Percentage of columns in comprehensible format 1.00 0.00 0.83 0.39
6.2. Guidelines for improving poor quality characteristics in OGD same dataset? How to manage (and fund the management of) this
feedback channel? Assuming that a versioning system assists the pro-
6.2.1. Time aspects and traceability cess from the technical point of view: who has the rights to modify
A possible solution to the problems illustrated above consists in ap- data? How is the process of re-validation managed? How much re-
plying versioning systems to open data, so that it is possible to easily ac- sources are required to supervise the users' feedback mechanism?
cess and compare different versions of the same data. Some proposals Alexopoulos, Loukis, and Charalabidis (2014) discuss some of these as-
already exist in the literature. Rufus Pollock, founder and co-Director pects and an open data infrastructure that provides feedback loops
of the Open Knowledge Foundation (OKF), addresses this problem in a with all stakeholders. Kuk and Davies (2011) analyze the value chain
blog post (Web ref. 12) and explores a solution based on well-known of Open Government Data via a multimethod study of the open data
tools. The proposed solution is based on a data pattern made up hackers in the UK.
of two pre-conditions: 1) data must be stored as line-oriented text,
and more specifically as CSV file, and 2) data must be stored on a GIT 6.2.3. Completeness
or Mercurial repository, which offer diff and merge tools. Canova, Completeness of data depends on both the type of data and the
Basso, Iemma, and Morando (2015) compare different approaches for domain. For sure it is of fundamental importance declaring the domain
versioning linked data. Sande, Dimou, Colpaert, Mannens, and Van de of data (see also Understandability). In addition, simple instruments
Walle (2013) have explored a prototypal solution of versioning Linked for data cleaning (e.g., Google Refine (Web ref. 14), Data Cleaner
Data with GIT, while dat project (Web ref. 13) offers a working alfa (Web ref. 15)) can be useful to assess the percentage of empty cells
version for versioning CSV/XML/JSON files. Van de Sompel et al. and understand the causes.
(2010) explore an HTTP-based data versioning model applied to linked
data, using a demonstrator built for DBpedia, also discussing how time- 6.2.4. Understandability and compliance
series analysis across versions of Linked Data descriptions could be The first step to improve data understandability is to provide also
elaborated. metadata. We used the eGMS standard as a reference for the most im-
These proposals rely on the assumption that versioning data important information to embed in metadata. A more comprehensive
proves traceability and encourages collaboration, as happened with checklist is provided on the Opquast website for web quality (Web
open source software development. ref. 16). In addition, the Open Data Foundation provides also useful in-
formation on state of the art and best practices (Web ref. 17).
6.2.2. Accuracy
A feedback mechanism to take note of the errors found by the users 7. Limitations
might be useful, perhaps in conjunction with the versioning system.
However such practice could give rise to several issues, such as: how This study is a first and partial attempt towards objective, reproduc-
to clearly label and distinguish official and unofficial versions of the ible, and scientifically based quality assessment of disclosed government
Table 7
Comparison of the evaluations.
Dimension Metric Open Coesione Municipalities p value Conf. interval
Completeness Percentage of complete cells + + 0.55 {−0.04; 0.04}

Percentage of complete rows − ± b0.05 {−1.00; −0.81}
Accuracy Percentage of syntactically accurate cells + ± b0.05 {−0.001; 0.04}
Accuracy in aggregation + + NaN NaN
Traceability Track of creation + + NaN NaN
Track of updates − − b0.05 {0.25; 0.25}
Currentness Percentage of current rows + + 0.20 {0; 0}
Delay in publication + ± b0.05 {0.08; 0.43}
Expiration Delay after expiration + ± b0.05 {0.99; 0.99}
Compliance Percentage of standardized columns + + NaN NaN
eGMS compliance + + b0.05 {0.04; 0.04}
Five star Open Data + + NaN NaN
Understandability Percentage of columns with metadata + − b0.05 {1.00; 1.00}
Percentage of columns in comprehensible format + ± 0.06 {0; 0}
data. It has some limitations, we discuss here the two that we consider sources and in the presence of a data quality process, and decentralized,
the most important ones. without any type of orchestration and of presumably lower quality,
selecting a set of municipalities' data in the same domains. We observed
7.1. Low generalization of results both common patterns and differences between the two compared re-
lease strategies. The metrics were able to show the benefits of the cen-
Due to time and effort required by manual evaluations of some tralized data disclosure used as example of good quality OGD, as well as
metrics, and due to the difficulty of finding comparable datasets even to quality issues originated from the samples of decentralized data dis-
in a large catalog as the Italian OGD portal, the number of datasets closure that we analyzed. We also provided guidelines and references to
evaluated is small and results cannot generalized. However, finding a improve Open Government Data on specific quality aspects, which
large number of datasets that represent the heterogeneity of the OGD might be valuable for those administrations which are not in the posi-
universe and are still comparable (for instance in terms of domains) tion to systematically apply a data quality process due to the relatively
is a task that we believe being not feasible. For this reason we believe high costs associated, and as useful indications for future research as
that the assessment framework proposed here has still high relevance, well.
because it allows an evidence-based evaluation of Open Government Further ongoing work is devoted to understand whether the prob-
Datasets. Replications from third parties of our analyses on different lems revealed by the metrics are able to predict problems experienced
data sources will increase in the long run the generalizability of the ob- by developers when reusing the data. Finally, future work will focus
tained results. on making the framework also applicable to non-tabular data and to de-
fine metrics for additional intrinsic quality characteristics. For instance
7.2. Selection refinement of the quality characteristics the characteristics and relevant metrics chosen for the framework are
not able to detect redundant and duplicate values in the datasets, or cor-
The selection of the quality characteristics was based on the results rectness of specific data formats. Also, the inability to make assessments
of a survey with a low number of respondents (n = 15). For this reason in terms of some characteristics such as Currentness due to some met-
we do not aim at considering those problems as representative of the rics not being calculable only with the dataset at hand hurts the applica-
most important quality characteristics for OGD from user and releaser bility of the framework and might require modification.
perspectives. However, the results of the survey served us as a basis to The long term goal of this study is to bring a data quality framework
prioritize the characteristics of intrinsic data quality that we wanted to to a level where it can be turned into a tool that automatically assesses
operationalize first with metrics. the quality of a dataset in terms of different characteristics, so that the
negative aspects can be strengthened before the set is released to the
8. Conclusions and future work public.
Current approaches in literature to the problem of OGD quality lack a

comprehensive theoretical framework. In addition, most of the evalua- Acknowledgments
tions concentrate on open data platforms, rather than on datasets. In
this work, we address these two limitations and provided a measure- We would like to sincerely thank Aline Pennisi at Italian Ministry
ment framework to quantitatively assess the quality of OGD. General State Accounting Department (RGS — Ragioneria Generale
We tested the suitability of the framework by applying it on OGD dello Stato) of Italy's Ministry of Economy and Finance. We also thank
samples from two different disclosure strategies: centralized, i.e. moni- Giuseppe Procaccianti from VU University, Amsterdam, for his initial
tored at national level with data aggregated from several regional contribution to this work.
Appendix A. Metrics defined.
Characteristic Metric Variables Formula Scale Normalization Alternative in literature
Traceability Track of creation s: Source tc= 2s + dc [0, 3] tcn ¼ tc3 –

dc: Date of creation
Track of updates lu: List of updates tu = lu + du [0, 2] tun ¼ tu
2
–
du: Dates of updates
Currentness Percentage of ncr: Number of not current rows pcr ¼ ð1 nr Þ
ncr
100 [0%, 100%] pcrn ¼ pcr
100
Several authors gave different
current rows nr: Number of rows. definitions of timeliness and
currency (Heinrich, Klier and
Kaiser, 2009). One of the most
used (adopted by methodologies
DQA, COLDQ, CDQ), is timeliness
defined as: Timeliness = (max
(0; 1 − Currency / Volatility))
(Batini, Cappiello, Francalanci
and Maurino, 2009).
Other references: Heinrich
(2002) & Ballou, Wang, Pazer
and Tayi (1998)
Delay in da: Date of information availability da= ed+ 1 (−∞, 1] dpn = dp –
publication dp: Date of publication
sd: Start date of the period of time dp ¼ 1 ðdpda Þ
edsd
referred by the dataset
ed: End date of the period of time
referred by the dataset.
Expiration Delay after ed: Expiration date dae ¼ 1 ðcded Þ (−∞, +∞) if(dae ≤ 0) –
edsd
expiration cd: Current date daen = 0
sd: Start date of the period of time else if(
Appendix
(continued)
A (continued)
Characteristic Metric Variables Formula Scale Normalization Alternative in literature
referred by the dataset dae ≤ 1)daen = rs

ed: End date of the period of time else if (dae N 1)
referred by the dataset. daen = 1
pcc
Completeness Percentage of nr: Number of rows ncl = nr *nc [0%, 100%] pccn ¼ 100 Completeness with the
complete cells nc: Number of columns pcc ¼ ð1 ncl
ic
Þ 100 “open world” assumption
ic: Number of incomplete cells (i.e., assumption that in
ncl: Number of cells the schema not all the real
world entities are represented)
(Batini & Scannapieco, 2006).
Percentage of nr: Number of rows pcpr ¼ ð1 nir
nr Þ 100
[0%, 100%] pcprn ¼ pcpr
100
–
complete rows nir: Number of incomplete rows
Compliance Percentage of ns: Number of columns with psc ¼ ðnsc
ns
Þ 100 [0%, 100%] psc
pscn ¼ 100 –
standardized associated standards
columns nsc: Number of standardized columns
eGMS s: Source egmsc = s + dc + c + t + [0–5] egmscn ¼ egmsc
5
Interpretability (metric used
compliance dc: Date of creation 0.2(d + id + pb + cv + l) in the Data Warehouse
c: Category Quality — DWQ methodology),
t: Title defined as: “Number of tuples
d: Description (if applicable) with interpretable data,
id: Identifier (if applicable) documentation for key values”
pb: Publisher (if applicable) (Batini et al., 2009; Jeusfeld
cv: Coverage (recommended only) et al., 1998).
l: Language (recommended only)
Five star Open This metric does not [0, 5] fsodn ¼ fsod –
5
Data require any formula;
the value assigned
depends on the level
of the scheme in
which the dataset is.
Understandability Percentage of ncm: Number of column with pcm ¼ ðncm
nc Þ 100 [0, 100] pcmn ¼ pcm
100
–
columns with metadata
metadata nc: Number of columns
Percentage of ncuf: Number of columns in pcuf ¼ ðncuf [0%, 100%] pcufn ¼ pcuf –
nc Þ 100 100
columns in understandable format
comprehensible nc: Number of columns
format
Accuracy Percentage of nce: Number of cells with errors pac ¼ ð1 nce
ncl
Þ 100 pac
[0%, 100%] pacn ¼ 100 Semantic accuracy, in which are
syntactically ncl: Number of cells considered not only the values
accurate cells not belonging to a certain
domain but also all the
values that don't represent
the real world entity correctly,
e.g. incoherent values, and
typos in names (Batini &
Scannapieco, 2006; Heinrich,
2002; Kaiser et al., 2007).
Accuracy in e: Errors sum n
e ¼ ∑i¼1 jdavi oavi j (−∞, 1] if(ea ≤ 0) The metric “derivation integrity”
aggregation s: Scale ea ¼ 1 ðesÞ ean = 0 in the TIMQ framework
oav: Own aggregation value calculates the same thing but in
dav: Dataset aggregation value else if(ea ≤ 0.9) a broader way, it is defined as
ean = 0.25 *ea “percentage of correct
calculations of derived data
else if(ea ≤ 0.95 ) according to the integrity
ean = 0.5 * ea derivation formula or calculation
definition” (Batini et al., 2009;
else if(ean ≤0.999) English, 1999).
ean = 0.75 *ea –
if(ea N 0.999)
ean = ea
Appendix B. Supplementary data Alexopoulos, C., Loukis, A., & Charalabidis, Y. (2014). A platform for closing the open
data feedback loop based on Web2. 0 functionality. Journal of Democracy and &
Open Government, 6(1).
Supplementary data to this article can be found online at http://dx. Allison, B. (2010). My data can't tell you that. In D. Lathrop, & L. Ruma (Eds.), Open
doi.org/10.1016/j.giq.2016.02.001. government — Collaboration, trasparency, and participation in practice (pp. 257–265).
O'Reilly Media, Inc.
Atz, U. (2014). The tau of data: A new metric to assess the timeliness of data in catalogues.
References Proceedings of the International Conference for E-Democracy and Open Government
(CeDEM2014), Krems, Austria.
Aichholzer, G., & Burkert, H. (2004). Public sector information in the digital age: Between Ballou, D. P., Wang, R. Y., Pazer, H., & Tayi, G. K. (1998). Modeling information manufactur-
markets, public management and citizens' rights. Edward Elgar Publishing. ing systems to determine information product quality. Management Science, 44(4).
Barlett, D. L., & Steele, J. B. (1985). Forevermore: Nuclear waste in America. Sharon, Dawes S. (2010). Stewardship and usefulness: Policy principles for information-
Basili, V., Caldiera, G., & Rombach, H. D. (1994). The goal question metric approach. In: based transparency. Government Information Quarterly, 27(4), 377–383.
Encyclopedia of software engineering. John Wiley & Sons, 528–532. Stiglitz, J. E., Orszag, P. R., & Orszag, J. M. (2000). The role of government in a digital age.
Batini, C., & Scannapieco, M. (2006). Data quality, concepts, methodologies and techniques. (Commissioned by the computer\& communications industry association).
Berlin: Springer. Tauberer, J. (2012). Open government data: The book.
Batini, C., Cappiello, C., Francalanci, C., & Maurino A. Methodologies for data quality as- Ubaldi, B. (2013). Open government data: Towards empirical analysis of open govern-
sessment and improvement. ACM Computing Surveys 41, 3, Article 16 (July 2009), ment data initiatives. Tech. rep. OECD Publishing.
(52 pp.). Umbrich, Jürgen, Neumaier, Sebastian, & Polleres, Axel (March 2015). Towards assessing
Behkamal, Behshid, Kahani, Mohsen, Bagheri, Ebrahim, & Jeremic, Zoran (May 2014). the quality evolution of Open Data portals. ODQ2015: Open data quality: From theory
A metrics-driven approach for quality assessment of linked open data. Journal of to practice workshop (Munich, Germany).
Theoretical and Applied Electronic Commerce Research, 9(2), 64–79. Van de Sompel, H., Sanderson, R., Nelson, M. L., Balakireva, L. L., Shankar, H., & Ainsworth,
Berners-Lee, T. (2006). Linked data-design issues. Tech. rep., W3C http://www.w3.org/ S. (2010). An HTTP-based versioning mechanism for linked data. (arXiv preprint arXiv:
DesignIssues/LinkedData.html 1003.3661).
Bovee, M., Srivastava, R., & Mak, B. (2003). Conceptual framework and belief-function Vickery, G. (2011). Review of recent studies on PSI re-use and related market developments.
approach to assessing overall information quality. International Journal of Intelligent Paris: Information Economics.
Systems, 51–74. Wand, Y., & Wang, R. (1996). Anchoring data quality dimensions in ontological founda-
Calero, C., Caro, A., & Piattini, M. (2008). An applicable data quality model for web portal tions. Communications of the ACM, 39, 11.
data consumers. World Wide Web, 11(4), 465–484. Wang, R., & Strong, D. (1996). Beyond accuracy: What data quality means to data con-
Canova, L., Basso, S., Iemma, R., & Morando, F. (2015). Collaborative open data versioning: sumers. Journal of Management Information Systems, 12, 4.
A pragmatic approach using linked data. International Conference for E-Democracy and Whitmore A. Using open government data to predict war: A case study of data and
Open Government 2015 (CeDEM15), Krems, Austria. systems challenges, Government Information Quarterly, Volume 31, Issue 4, October
Catarci, T., & Scannapieco, M. (2002). Data quality under the computer science perspec- 2014, Pages 622-630, (ISSN 0740-624X).
tive. Archivi Computer, 2. Zuiderwijk, A., Janssen, M., & Davis, C. (2014). Innovation with open data: Essential ele-
Detlor, B., Hupfer, Maureen E, Ruhi, U. & Zhao, L. Information quality and community ments of open data ecosystems. Information Policy, 19(1), 17–33.
municipal portal use, Government Information Quarterly, Volume 30, Issue 1, Zuiderwijk, Anneke, & Janssen, Marijn (2015). Participation and data quality in open data
January 2013, Pages 23-32, http://dx.doi.org/10.1016/j.giq.2012.08.00, (ISSN use: Open data infrastructures evaluated. Proceedings of the 15th European Conference
0740-624X). on eGovernment 2015: ECEG 2015. Academic Conferences Limited.
English, L. (1999). Improving data warehouse and business information quality. Wiley &
Sons.
Even, A., & Shankaranarayanan, G. (2009). Utility cost perspectives in data quality man- Web references
agement. The Journal of Computer Information Systems, 50(2), 127–135.
Ferro, E., & Osella, M. (2013). Eight business model archetypes for PSI re-use. “Open data Web ref. 1. http://opendefinition.org/(last visited on November 5, 2015).
on the web” workshop, Google campus, London. Web ref. 2. http://census.okfn.org/, (last visited on November 5, 2015).
Haug, A., Pedersen, A., & Arlbjørn, J. S. (2009). A classification model of ERP system data Web ref. 3. http://goo.gl/QlUT1d (last visited on November 5, 2015).
quality. Industrial Management & Data Systems, 109(8), 1053–1068. http://dx.doi. Web ref. 4. https://goo.gl/yq4ElP (last visited on November 5, 2015).
org/10.1108/02635570910991292. Web ref. 5. http://sunlightfoundation.com/blog/2010/06/23/elenas-inbox/ (last visit on
Heinrich, B. (2002). Datenqualitätsmanagement in Data Warehouse-Systemen. (doctoral November 5, 2015).
thesis, Oldenburg). Web ref. 6. http://nexa.polito.it/lunch-9 (last visit on November 5, 2015).
Heinrich, B., Klier, M., & Kaiser, M. (2009). A procedure to develop metrics for cur- Web ref. 7. http://okfnlabs.org/bad-data/ (last visit on November 5, 2015).
rency and its application in CRM. Journal of Data and Information Quality, 1(1), Web ref. 8. http://www.dati.gov.it/dataset (last visit on November 5, 2015).
5. Web ref. 9. https://www.opengovawards.org/Awards_Booklet_Final.pdf, (last visit on
Helbig, N., Nakashima, M., & Dawe, Sharon S. (June 4–7, 2012). Understanding the value November 5, 2015).
and limits of government information in policy informatics: A preliminary explora- Web ref. 10. http://en.wikipedia.org/wiki/European_Social_Fund, (last visit on November
tion. Proceedings of the 13th Annual International Conference on Digital Government 5, 2015).
Research (dg.o2012). Web ref. 11. http://www.dati.gov.it/(last visit on November 5, 2015).
Hofmokl, J. (2010). The Internet commons: toward an eclectic theoretical framework. Web ref. 12. http://blog.okfn.org/2013/07/02/git-and-github-for-data/ (last visited on
International Journal of the Commons, 4(1), 226–250. November 5, 2015).
Iemma, R., Morando, F., & Osella, M. (2014). Breaking public administrations' data silos. Web ref. 13. http://dat-data.com/(last visited on November 5, 2015).
eJournal of eDemocracy & Open Government, 6(2). Web ref. 14. https://code.google.com/p/google-refine/ (last visited on November 5, 2015).
Janssen, M., Charalabidis, Y., & Zuiderwijk, A. (2012). Benefits, adoption barriers and Web ref. 15. http://datacleaner.org/, (last visited on November 5, 2015).
myths of open data and open government. Information Systems Management, 29(4), Web ref. 16. http://checklists.opquast.com/en/opendata, (last visited on November 5,
258–268. 2015).
Vassiliou, Y. (1995). In M. Jarke, M. Lenzerini, & P. Vassiliadis (Eds.), Fundamentals of data Web ref. 17. http://odaf.org/papers/Open%20Data%20and%20Metadata%20Standards.pdf,
warehouses. Springer Verlag. (last visited on November 5, 2015).
Jeusfeld, M., Quix, C., & Jarke, M. (1998). Design and analysis of quality information for
datawarehouses. Proceedings of the 17th International Conference on Conceptual
Modeling.
Kaiser, M., Klier, M., & Heinrich, B. (2007). “How to measure data quality? — A metric-
Antonio Vetrò is Director of Research and Policy at the Nexa
based approach” (2007). ICIS 2007 Proceedings. Paper 108.
Center for Internet and Society at Politecnico di Torino
Kim, W. (2002). On three major holes in data warehousing today. Journal of Object
(Italy). Formerly, he has been a research fellow in the Soft-
Technology, 1(4), 39–47. http://dx.doi.org/10.5381/jot.2002.1.4.c3.
ware and System Engineering Department at Technische
Kuk, G., & Davies, T. (2011). The roles of agency and artifacts in assembling open data
Universität München (Germany) and junior scientist at
complementarities. Thirty Second International Conference on Information Systems.
Fraunhofer Center for Experimental Software Engineering
Madnick, S., Wang, R., & Xian, X. (2004). The design and implementation of a corporate
(MD, U.S.A.). He holds a PhD in Information and System En-
householding knowledge processor to improve data quality. Journal of Management
gineering from Politecnico di Torino (Italy). He is interested
Information Systems, 20(1), 41–49.
in studying the impact of technology on society, with a focus
Maurino, A., Spahiu, B., Batini, C., & Viscusi, G. (2014). Compliance with Open Government
on technology transfer and internet science, and adopting an
Data Policies: an empirical evaluation of Italian local public administrations. Twenty
empirical epistemological approach. Contact him at , antonio.
Second European Conference on Information Systems, Tel Aviv.
[email protected].
Mayer-Schönberger, V., & Zappia, Z. (2011). Participation and power: intermediaries of
open data. 1st Berlin Symposium on Internet and Society.
Moraga, C., Moraga, M., Calero, C., & Caro, A. (2009). SQuaRE-aligned data quality model
for web portals. Quality Software, 2009. QSIC’09. 9th International Conference on
Lorenzo Canova holds a MSc in Industrial engineering from
(pp. 117–122).
Politecnico di Torino and wrote his Master thesis on Open
Naumann, F. (2002). Quality-driven query answering for integrated information systems.
Government Data quality. He is currently a research fellow
Lecture Notes in Computer Science, 2261.
of the Nexa Center for Internet & Society at Politecnico di To-
Redman, T. (1996). Data quality for the information age. Artech House.
rino. His main research interests are in Open Government
Reiche, K., & Hofig, E. (2013). Implementation of metadata quality metrics and application
Data, government transparency and linked data. Contact
on public government data. Computer software and applications conference workshops
him at , [email protected].
(COMPSACW), 2013 IEEE 37th annual (pp. 236–241).
Sachs, L. (1982). Applied statistics. A handbook of techniques. New York — Heidelberg —
Berlin: Springer-Verlag (734 pp., 59 figs., DM 118,-).
Sande, M. V., Dimou, A., Colpaert, P., Mannens, E., & Van de Walle, R. (2013). Linked data
as enabler for open data ecosystems. Open data on the web 23–24 April 2013, Campus
London, Shoreditch.
Marco Torchiano is an associate professor at Politecnico di Raimondo Iemma holds a MSc in Industrial engineering
Torino, Italy; he has been a post-doctoral research fellow at from Politecnico di Torino (Italy). His research focuses on
Norwegian University of Science and Technology (NTNU), open data platforms, smart disclosure of consumption data,
Norway. He received an MSc and a PhD in Computer Engi- economics of information, and, in general, open government
neering from Politecnico di Torino, Italy. He is an author or and innovation within (or fostered by) the public sector. Be-
coauthor of more than 100 research papers published in in- tween 2013 and 2015, he served as a Managing Director & re-
ternational journals and conferences. He is the co-author of search fellow at the Nexa Center for Internet & Society at
the book ‘Software Development—Case studies in Java’ from Politecnico di Torino. He works as a project manager at
Addison-Wesley, and the co-editor of the book ‘Developing Seac02, a digital company based in Torino. Contact him at ,
Services for the Wireless Internet’ from Springer. His current [email protected].
research interests are: design notations, testing methodolo-
gies, OTS-based development, and static analysis. The meth-
odological approach he adopts is that of empirical software
engineering. Contact him at , [email protected].
Federico Morando (M.A.S. in Economic theory and econo-

metrics from the Univ. of Toulouse, Ph.D. in Institutions,
Camilo O. Minotas holds a MSc from Politecnico di Torino Economics and Law from the Univ. of Turin and Ghent) is
and wrote his Master thesis on Open Data Quality Assess- an economist, with interdisciplinary research interests
ment. He is currently a software developer in Colombia. His focused on the intersection between law, economics and
main research interests are in open data quality and govern- technology. He taught intellectual property and competition
ment transparency. Contact him at , [email protected]. law at Bocconi University in Milan, and he lectures at
the Politecnico di Torino, at the WIPO LL.M. in Intellectual
Property and in other post-graduate and doctoral courses.
Between 2009 and 2015, he served as a Managing Director
and then Director of Research and Policy at the Nexa Center
for Internet & Society at Politecnico di Torino. He is currently
leading a startup company that focused on linked data.
Contact him at , [email protected].

10 1016j Giq 2016 02 001

Uploaded by

Copyright:

Available Formats

10 1016j Giq 2016 02 001

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1016j Giq 2016 02 001

Uploaded by

Copyright:

Available Formats

GOVINF-01153; No.

of pages: 13; 4C:

Contents lists available at ScienceDirect

Government Information Quarterly

journal homepage: www.elsevier.com/locate/govinf

Open data quality measurement framework: Deﬁnition and application

1. Introduction actors, ranging from companies to Non-Governmental Organizations,

Characteristic Metric Level Description

Metric Dataset-speciﬁc reﬁnement Reason

Open Coesione (n = 20) Municipalities (n = 11)

Dimension Metric Average Standard deviation Average Standard deviation

Percentage of complete cells 0.94 0.07 0.92 0.15

Dimension Metric Open Coesione Municipalities p value Conf. interval

Completeness Percentage of complete cells + + 0.55 {−0.04; 0.04}

Current approaches in literature to the problem of OGD quality lack a

Appendix A. Metrics deﬁned.

Characteristic Metric Variables Formula Scale Normalization Alternative in literature

Traceability Track of creation s: Source tc= 2s + dc [0, 3] tcn ¼ tc3 –

Characteristic Metric Variables Formula Scale Normalization Alternative in literature

referred by the dataset dae ≤ 1)daen = rs

Federico Morando (M.A.S. in Economic theory and econo-

You might also like