International Journal of Data Science and Analytics (2021) 11:263–278


Data science: a game changer for science and innovation

Valerio Grossi1 · Fosca Giannotti1 · Dino Pedreschi2 · Paolo Manghi3 · Pasquale Pagano3 · Massimiliano Assante3

Received: 13 July 2019 / Accepted: 15 December 2020 / Published online: 19 April 2021
© The Author(s) 2021

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present
how data science impacts science and society at large in the coming years, including ethical problems in managing human
behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such
as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data
scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex
data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of
our contemporary, globally interconnected society.

Keywords Responsible data science · Research infrastructure · Social mining

1 Introduction: from data to knowledge ples to learn from, (ii) the advances in data analysis and
learning techniques that can produce predictive models and
Data science is an interdisciplinary and pervasive paradigm behavioral patterns from big data, and (iii) the advances
where different theories and models are combined to trans- in high-performance computing infrastructures that make it
form data into knowledge (and value). Experiments and possible to ingest and manage big data and perform complex
analyses over massive datasets are functional not only to the analysis [16].
validation of existing theories and models but also to the Paper organization Section 2 discusses how data science
data-driven discovery of patterns emerging from data, which impacts our science and society at large in the coming years.
can help scientists in the design of better theories and mod- Section 3 outlines the main issues related to the ethical
els, yielding a deeper understanding of the complexity of the problems in studying human behaviors that data science
social, economic, biological, technological, cultural, and nat- introduces. In Sect. 4, we show how concepts such as open
ural phenomenon. The products of data science are the result science and e-infrastructure are effective tools for support-
of re-interpreting available data for analysis goals that dif- ing, disseminating ethical uses of the data, and training new
fer from the original reasons motivating data collection. All generations of data scientists. We will illustrate the impor-
these aspects are producing a change in the scientific method, tance of an open data science with examples provided later
in research and in the way our society makes decisions [2]. in the paper. Finally, we show some use cases of data science
Data science emerges to concurring facts: (i) the advent through thematic environments that bind the datasets with
of big data that provides the critical mass of actual exam- social mining methods.

2 Data science for society, science, industry
and business
1 CNR - Istituto Scienza e Tecnologia dell'Informazione A. Faedo, KDDLab, Pisa, Italy
Faedo, KDDLab, Pisa, Italy The quality of business decision making, government
2 Department of Computer Science, University of Pisa, Pisa, Italy
Italy improved by analyzing data. Data science offers important
3 CNR - Istituto Scienza e Tecnologia dell'Informazione A. Faedo, NeMIS, Pisa, Italy
Faedo, NeMIS, Pisa, Italy with remarkable accuracy and timeliness.

Fig. 1 Data science as an ecosystem: on the left, the figure shows the society, science, and business. All the activities related to data science
main components enabling data science (data, analytical methods, and should be done under rigid ethical principles
infrastructures). On the right, we can find the impact of data science into

As shown in Fig. 1, data science is an ecosystem where If we want data science to face the global challenges and
the following scientific, technological, and socioeconomic become a determinant factor of sustainable development, it
factors interact: is necessary to push towards an open global ecosystem for
science, industrial, and societal innovation [48]. We need to
– Data Availability of data and access to data sources; build an ecosystem of socioeconomic activities, where each
– Analytics & computing infrastructures Availability of new idea, product, and service create opportunities for further
high performance analytical processing and open-source purposes, and products. An open data strategy, innovation,
analytics; interoperability, and suitable intellectual property rights can
– Skills Availability of highly and rightly skilled data sci- catalyze such an ecosystem and boost economic growth and
entists and engineers; sustainable development. This strategy also requires a “net-
– Ethical & legal aspects Availability of regulatory envi- worked thinking” and a participatory, inclusive approach.
ronments for data ownership and usage, data protection Data are relevant in almost all the scientific disciplines,
and privacy, security, liability, cybercrime, and intellec- and a data-dominated science could lead to the solution of
tual property rights; problems currently considered hard or impossible to tackle.
– Applications Business and market ready applications; It is impossible to cover all the scientific sectors where a data-
– Social aspects Focus on major societal global challenges. driven revolution is ongoing; here, we shall only provide just
a few examples.
The Sloan Digital Sky Survey1 has become a central
Data science envisioned as the intersection between data resource for astronomers over the world. Astronomy is being
mining, big data analytics, artificial intelligence, statistical
modeling, and complex systems is capable of monitoring
data quality and analytical processes results transparently. 1

Fig. 2 The data science pipeline starts with raw data and transforms them into data used for analytics. The next step is to transform these data into
knowledge through analytical methods and then provide results and evaluation measures

transformed from the one where taking pictures of the sky Now, we illustrate the typical data science pipeline [50].
was a large part of an astronomer’s job, to the one where the People, machines, systems, factories, organizations, com-
images are already in a database, and the astronomer’s task is munities, and societies produce data. Data are collected in
to find interesting objects and phenomenon in the database. every aspect of our life, when: we submit a tax declara-
In biological sciences, data are stored in public repositories. tion; a customer orders an item online; a social media user
There is an entire discipline of bioinformatics that is devoted posts a comment; a X-ray machine is used to take a pic-
to the analysis of such data.2 Data-centric approaches based ture; a traveler sends a review on a restaurant; a sensor in a
on personal behaviors can also support medical applications supply chain sends an alert; or a scientist conducts an experi-
analyzing data at both human behavior levels and lower ment. This huge and heterogeneous quantity of data needs to
molecular ones. For example, integrating genome data of be extracted, loaded, understood, transformed, and in many
medical reactions with the habits of the users, enabling a cases, anonymized before they may be used for analysis.
computational drug science for high-precision personalized Analysis results include routines, automated decisions, pre-
medicine. In humans, as in other organisms, most cellular dictions, and recommendations, and outcomes that need to
components exert their functions through interactions with be interpreted to produce actions and feedback. Furthermore,
other cellular components. The totality of these interactions this scenario must also consider ethical problems in manag-
(representing the human “interactome”) is a network with ing social data. Figure 2 depicts the data science pipeline.3
hundreds of thousand nodes and a much larger number of Ethical aspects are important in the application of data sci-
links. A disease is rarely a consequence of an abnormality in ence in several sectors, and they are addressed in Sect. 3.
a single gene. Instead, the disease phenotype is a reflection
of various pathological processes that interact in a complex 2.1 Impact on society
network. Network-based approaches can have multiple bio-
logical and clinical applications, especially in revealing the Data science is an opportunity for improving our society
mechanisms behind complex diseases [6]. and boosting social progress. It can support policymaking;
it offers novel ways to produce high-quality and high-

2 e.g., 3 Responsible Data Science program:

precision statistical information and empower citizens with user-centric view where data are collected, integrated and
self-awareness tools. Furthermore, it can help to promote analyzed at the individual level, and providing the user with
ethical uses of big data. better awareness of own behavioral, health, and consumer
Modern cities are perfect environments densely traversed profiles. Within this user-centric perspective, there is room
by large data flows. Using traffic monitoring systems, envi- for an even broader market of business applications, such as
ronmental sensors, GPS individual traces, and social infor- high-precision real-time targeted marketing, e.g., self-orga-
mation, we can organize cities as a collective sharing of nizing decision making to preserve desired global properties,
resources that need to be optimized, continuously monitored, and sustainability of the transportation or the healthcare sys-
and promptly adjusted when needed. It is easy to understand tem. Such contexts emphasize two essential aspects of data
the potentiality of data science by introducing terms such as science: the need for creativeness to exploit and combine
urban planning, public transportation, reduction of energy the several data sources in novel ways and the need to give
consumption, ecological sustainability, safety, and manage- awareness and control of the personal data to the users that
ment of mass events. These terms represent only the front line generate them, to sustain a transparent, trust-based, crowd-
of topics that can benefit from the awareness that big data sourced data ecosystem [19].
might provide to the city stakeholders [22,27,29]. Several The impact of online social networks in our society has
methods allowing human mobility analysis and prediction changed the mechanisms behind information spreading and
are available in the literature: MyWay [47] exploits individ- news production. The transformation of media ecosystems
ual systematic behaviors to predict future human movements and news consumption are having consequences in several
by combining individual and collective learned models. Car- fields. A relevant example is the impact of misinformation
pooling [22] is based on mobility data from travelers in a on society, as for the Brexit referendum when the massive
given territory and constructs a network of potential carpool- diffusion of fake news has been considered one of the most
ing users, by exploiting topological properties, highlighting relevant factors of the outcome of this political event. Exam-
sub-populations with higher chances to create a carpooling ples of achievements are provided by the results regarding
community and the propensity of users to be either drivers or the influence of external news media on polarization in online
passengers in a shared car. Event attendance prediction [13] social networks. These achievements indicate that users are
analyzes users’ call habits and classifies people into behav- highly polarized towards news sources, i.e., they cite (and
ioral categories, dividing them among residents, commuters, tend to cite) sources that they identify as ideologically simi-
and visitors and allows to observe the variety of behaviors of lar to them. Other results regard echo chambers and the role
city users and the attendance in big events in cities. of social media users: there is a strong correlation between
Electric mobility is expected to gain importance for the the orientation of the content produced and consumed. In
world. The impact of a complete switch to electric mobility is other words, an opinion “echoes” back to the user when oth-
still under investigation, and what appears to be critical is the ers are sharing it in the “chamber” (i.e., the social network
intensity of flows due to charge (and fast recharge) systems around the user) [36]. Other results worth mentioning regard
that may challenge the stability of the power network. To efforts devoted to uncovering spam and bot activities in stock
avoid instabilities regarding the charging infrastructure, an microblogs on Twitter: taking inspiration from biological
accurate prediction of power flows associated with mobility DNA, the idea is to model the online users’ behavior through
is needed. The use of personal mobility data can estimate the strings of characters representing sequences of online users’
mobility flow and simulate the impact of different charging actions. As a result of the following papers, [11,12] report
behavioral patterns to predict power flows and optimize the that 71% of suspicious users were classified as bots; further-
position of the charging infrastructures [25,49]. Lorini et al. more, 37% of them also got suspended by Twitter few months
[26] is an example of an urban flood prediction that integrates after our investigation. Several approaches can be found in
data provided by CEM system4 and Twitter data. Twitter the literature. However, they generally display some limita-
data are processed using massive multilingual approaches tions. Some of them work only on some of the features of
for classification. The model is a supervised model which the diffusion of misinformation (bot detections, segregation
requires a careful data collection and validation of ground of users due to their opinions or other social analysis), or
truth about confirmed floods from multiple sources. there is a lack of comprehensive frameworks for interpret-
Another example of data science for society can be found ing results. While the former case is somehow due to the
in the development of applications with functions aimed innovation of the research field and it is explainable, the lat-
directly at the individual. In this context, concepts such as ter showcases a more fundamental need, as, without strict
personal data stores and personal data analytics are aimed statistical validation, it is hard to state which are the crucial
at implementing a new deal on personal data, providing a elements that permit a well-grounded description of a system.
For avoiding fake news diffusion, we can state that building
4 a comprehensive fake news dataset providing all informa-

tion about publishers, shared contents, and the engagements available as a real-time experimenting society for under-
of users over space and time, together with their profile sto- standing social mechanisms, like harassment, discrimination,
ries, can help the development of innovative and effective hate, and fake news. In our vision, the use of data science
learning models. Both unsupervised and supervised meth- approaches is necessary for better governance. These new
ods will work together to identify misleading information. approaches integrate and change the Official Statistics rep-
Multidisciplinary teams made up of journalists, linguists, and resenting a cheaper and more timely manner of computing
behavioral scientists and similar will be needed to identify them. The impact of data science-driven applications can be
what amounts to information warfare campaigns. Cyberwar- particularly significant when the applications help to build
fare and information warfare will be two of the biggest threats new infrastructures or new services for the population.
the world will face in the 21st Century. The availability of massive data portraying soccer perfor-
Social sensing methods collect data produced by digital mance has facilitated recent advances in soccer analytics.
citizens, by either opportunistic or participatory crowd- Rossi et al. [42] proposed an innovative machine learning
sensing, depending on users’ awareness of their involvement. approach to the forecasting of non-contact injuries for pro-
These approaches present a variety of technological and ethi- fessional soccer players. In [3], we can find the definition
cal challenges. An example is represented by Twitter Monitor of quantitative measures of pressing in defensive phases in
[10], that is crowd-sensing tool designed to access Twit- soccer. Pappalardo et al. [33] outlined the automatic and data-
ter streams through the Twitter Streaming API. It allows driven evaluation of performance in soccer, a ranking system
launching parallel listening for collecting different sets of for soccer teams. Sports data science is attracting much inter-
data. Twitter Monitor represents a tool for creating services est and is now leading to the release of a large and public
for listening campaigns regarding relevant events such as dataset of sports events.
political elections, natural and human-made disasters, popu- Finally, data science has unveiled a shift from popula-
lar national events, etc. [11]. This campaign can be carried tion statistics to interlinked entities statistics, connected by
out, specifying keywords, accounts, and geographical areas mutual interactions. This change of perspective reveals uni-
of interest. versal patterns underlying complex social, economic, techno-
Nowcasting5 financial and economic indicators focus on logical, and biological systems. It is helpful to understand the
the potential of data science as a proxy for well-being and dynamics of how opinions, epidemics, or innovations spread
socioeconomic applications. The development of innovative in our society, as well as the mechanisms behind complex
research methods has demonstrated that poverty indicators systemic diseases, such as cancer and metabolic disorders
can be approximated by social and behavioral mobility met- revealing hidden relationships between them. Considering
rics extracted from mobile phone data and GPS data [34]; and diffusive models and dynamic networks, NDlib [40] is a
the Gross Domestic Product can be accurately nowcasted by Python package for the description, simulation, and observa-
using retail supermarket market data [18]. Furthermore, now- tion of diffusion processes in complex networks. It collects
casting of demographic aspects of territory based on Twitter diffusive models from epidemics and opinion dynamics and
data [1] can support official statistics, through the estima- allows a scientist to compare simulation over synthetic sys-
tion of location, occupation, and semantics. Networks are a tems. For community discovery, two tools are available for
convenient way to represent the complex interaction among studying the structure of a community and understand its
the elements of a large system. In economics, networks are habits: Demon [9] extracts ego networks (i.e., the set of nodes
gaining increasing attention because the underlying topol- connected to an ego node) and identifies the real communities
ogy of a networked system affects the aggregate output, the by adopting a democratic, bottom-up merging approach of
propagation of shocks, or financial distress; or the topol- such structures. Tiles [41] is dedicated to dynamic network
ogy allows us to learn something about a node by looking data and extracts overlapping communities and tracks their
at the properties of its neighbors. Among the most inves- evolution in time following an online iterative procedure.
tigated financial and economic networks, we cite a work
that analyzes the interbank systems, the payment networks 2.2 Impact on industry and business
between firms, the banks-firms bipartite networks, and the
trading network between investors [37]. Another interesting Data science can create an ecosystem of novel data-driven
phenomenon is the advent of blockchain technology that has business opportunities. As a general trend across all sec-
led to the innovation of bitcoin crypto-currency [31]. tors, massive quantities of data will be made accessible to
Data science is an excellent opportunity for policy, data everybody, allowing entrepreneurs to recognize and to rank
journalism, and marketing. The online media arena is now shortcomings in business processes, to spot potential threads
and win-win situations. Ideally, every citizen could establish
5 Nowcasting in economics is the prediction of the present, the very from these patterns new business ideas. Co-creation enables
near future, and the very recent past state of an economic indicator. data scientists to design innovative products and services.

The value of joining different datasets is much larger than ital records of personal activities that contain potentially
the sum of the value of the separated datasets by sharing data sensitive information. Personal information can be used to
of various nature and provenance. discriminate people based on their presumed characteristics.
The gains from data science are expected across all sec- Data-driven algorithms yield classification and prediction
tors, from industry and production to services and retail. In models of behavioral traits of individuals, such as credit
this context, we cite several macro-areas where data science score, insurance risk, health status, personal preferences, and
applications are especially promising. In energy and environ- religious, ethnic, or political orientation, based on personal
ment, the digitization of the energy systems (from production data disseminated in the digital environment by users (with
to distribution) enables the acquisition of real-time, high- or often without their awareness). The achievements of data
resolution data. Coupled with other data sources, such as science are the result of re-interpreting available data for anal-
weather data, usage patterns, and market data (accompanied ysis goals that differ from the original reasons motivating
by advanced analytics), efficiency levels can be increased data collection. For example, mobile phone call records are
immensely. The positive impact to the environment is also initially collected by telecom operators for billing and oper-
enhanced by geospatial data that help to understand how our ational aims, but they can be used for accurate and timely
planet and its climate are changing and to confront major demography and human mobility analysis at a country or
issues such as global warming, preservation of the species, regional scale. This re-purposing of data clearly shows the
the role and effects of human activities. importance of legal compliance and data ethics technologies
The manufacturing and production sector with the grow- and safeguards to protect privacy and anonymity; to secure
ing investments into Industry 4.0 and smart factories with data; to engage users; to avoid discrimination and misuse; to
sensor-equipped machinery that are both intelligent and net- account for transparency; and to the purpose of seizing the
worked (see internet of things. Cyber-physical systems) will opportunities of data science while controlling the associated
be one of the major producers of data in the world. The appli- risks.
cation of data science into this sector will bring efficiency Several aspects should be considered to avoid to harm
gains and predictive maintenance. Entirely new business individual privacy. Ethical elements should include the: (i)
models are expected since the mass production of individ- monitoring of the compliance of experiments, research pro-
ualized products becomes possible where consumers may tocols, and applications with ethical and juridical standards;
have direct access to influence and control. (ii) developing of big data analytics and social mining tools
As already stated in Sect. 2.1, data science will con- with value-sensitive design and privacy-by-design method-
tribute to increasing efficiency in public administrations ologies; (iii) boosting of excellence and international com-
processes and healthcare. In the physical and the cyber- petitiveness of Europe’s big data research in safe and fair
domain, security will be enhanced. From financial fraud to use of big data for research. It is essential to highlight that
public security, data science will contribute to establishing a data scientists using personal and social data also through
framework that enables a safe and secure digital economy. infrastructures have the responsibility to get acquainted with
Big data exploitation will open up opportunities for innova- the fundamental ethical aspects relating to becoming a “data
tive, self-organizing ways of managing logistical business controller.” This aspect has to be considered to define courses
processes. Deliveries could be based on predictive moni- for informing and training data scientists about the respon-
toring, using data from stores, semantic product memories, sibilities, the possibilities, and the boundaries they have in
internet forums, and weather forecasts, leading to both eco- data manipulation.
nomic and environmental savings. Let us also consider the Recalling Fig. 2, it is crucial to inject into the data science
impact of personalized services for creating real experiences pipeline the ethical values of fairness: how to avoid unfair and
for tourists. The analysis of real-time and context-aware data discriminatory decisions; accuracy: how to provide reliable
(with the help of historical and cultural heritage data) will information; confidentiality: how to protect the privacy of
provide customized information to each tourist, and it will the involved people and transparency: how to make models
contribute to the better and more efficient management of the and decisions comprehensible to all stakeholders. This value-
whole tourism value chain. sensitive design has to be aimed at boosting widespread social
acceptance of data science, without inhibiting its power.
Finally, it is essential to consider also the impact of the Gen-
3 Data science ethics eral Data Protection Regulation (GDPR) on (i) companies’
duties and how these European companies should comply
Data science creates great opportunities but also new risks. with the limits in data manipulation the Regulation requires;
The use of advanced tools for data analysis could expose and on (ii) researchers’ duties and to highlight articles and
sensitive knowledge of individual persons and could invade recitals which specifically mention and explain how research
their privacy. Data science approaches require access to dig- is intended in GDPR’s legal system.

Fig. 3 The relationship between big and open data and how they relate to the broad concept of open government

We complete this section with another important aspect evaluate the balance between public benefits and personal
related to open data, i.e., accessible public data that people, loss of protection. On the other hand, when data are aimed
companies, and organizations can use to launch new ventures, to be used for commercial purposes, the value mentioned
analyze patterns and trends, make data-driven decisions, and above might instead translate into simple pricing of personal
solve complex problems. All the definitions of open data information that the user might sell to a company for its busi-
include two features: (i) the data must be publicly available ness. In this context, discrimination discovery consists of
for anyone to use, and (ii) data must be licensed in a way searching for a-priori unknown contexts of suspect discrimi-
that allows for its reuse. All over the world, initiatives are nation against protected-by-law social groups, by analyzing
launched to make data open by government agencies and datasets of historical decision records. Machine learning and
public organizations; listing them is impossible, but an UN data mining approaches may be affected by discrimination
initiative has to be mentioned. Global Pulse6 meant to imple- rules, and these rules may be deeply hidden within obscure
ment the vision for a future in which big data is harnessed artificial intelligence models. Thus, discrimination discovery
safely and responsibly as a public good. consists of understanding whether a predictive model makes
Figure 3 shows the relationships between open data and direct or indirect discrimination. DCube [43] is a tool for
big data. Currently, the problem is not only that government data-driven discrimination discovery, a library of methods
agencies (and some business companies) are collecting per- on fairness analysis.
sonal data about us, but also that we do not know what data It is important to evaluate how a mining model or algo-
are being collected and we do not have access to the infor- rithm takes its decision. The growing field of methods
mation about ourselves. As reported by the World Economic for explainable machine learning provides and continu-
forum in 2013, it is crucial to understand the value of personal ously expands a set of comprehensive tool-kits [21]. For
data to let the users make informed decisions. A new branch example, X-Lib is a library containing state-of-the-art expla-
of philosophy and ethics is emerging to handle personal data nation methods organized within a hierarchical structure and
related issues. On the one hand, in all cases where the data wrapped in a similar fashion way such that they can be
might be used for the social good (i.e., medical research, easily accessed and used from different users. The library
improvement of public transports, contrasting epidemics), provides support for explaining classification on tabular data
and understanding the personal data value means to correctly and images and for explaining the logic of complex deci-
sion systems. X-Lib collects, among the others, the following
6 collection of explanation methods: LIME [38], Anchor [39],

DeepExplain that includes Saliency maps [44], Gradient * nities (and potential stakeholders) by activating appropriate
Input, Integrated Gradients, and DeepLIFT [46]. Saliency dissemination channels.
method is a library containing code for SmoothGrad [45],
as well as implementations of several other saliency tech- 4.1 The SoBigData Research Infrastructure
niques: Vanilla Gradients, Guided Backpropogation, and
Grad-CAM. Another improvement in this context is the use The SoBigData Research Infrastructure7 is an ecosystem
of robotics and AI in data preparation, curation, and in detect- of human and digital resources, comprising data scientists,
ing bias in data, information and knowledge as well as in the analytics, and processes. As shown in Fig. 4, SoBigData is
misuse and abuse of these assets when it comes to legal, pri- designed to enable multidisciplinary scientists and innova-
vacy, and ethical issues and when it comes to transparency tors to realize social mining experiments and to make them
and trust. We cannot rely on human beings to do these tasks. reusable by the scientific communities. All the components
We need to exploit the power of robotics and AI to help pro- have been introduced for implementing data science from
vide the protections required. Data and information lawyers raw data management to knowledge extraction, with particu-
will play a key role in legal and privacy issues, ethical use of lar attention to legal and ethical aspects as reported in Fig. 1.
these assets, and the problem of bias in both algorithms and SoBigData supports data science serving a cross-disciplinary
the data, information, and knowledge used to develop ana- community of data scientists studying all the elements of
lytics solutions. Finally, we can state that data science can societal complexity from a data- and model-driven perspec-
help to fill the gap between legislators and technology. tive.
Currently, SoBigData includes scientific, industrial, and
other stakeholders. In particular, our stakeholders are data
analysts and researchers (35.6%), followed by companies
4 Big data ecosystem: the role of research (33.3%) and policy and lawmakers (20%). The following
infrastructures sections provide a short but comprehensive overview of the
services provided by SoBigData RI with special attention on
Research infrastructures (RIs) play a crucial role in the advent supporting ethical and open data science [15,16].
and development of data science. A social mining experi-
ment exploits the main components of data science depicted 4.1.1 Resources, facilities, and access opportunities
in Fig. 1 (i.e., data, infrastructures, analytical methods) to
enable multidisciplinary scientists and innovators to extract Over the past decade, Europe has developed world-leading
knowledge and to make the experiment reusable by the scien- expertise in building and operating e-infrastructures. They
tific community, innovators providing an impact on science are large-scale, federated and distributed online research
and society. environments through which researchers can share access
Resources such as data and methods help domain and data to scientific resources (including data, instruments, comput-
scientists to transform research or an innovation question into ing, and communications), regardless of their location. They
a responsible data-driven analytical process. This process is are meant to support unprecedented scales of international
executed onto the platform, thus supporting experiments that collaboration in science, both within and across disciplines,
yield scientific output, policy recommendations, or innova- investing in economy-of-scale and common behavior, poli-
tive proofs-of-concept. Furthermore, an operational ethical cies, best practices, and standards. They shape up a common
board’s stewardship is a critical factor in the success of a RI. environment where scientists can create, validate, assess,
An infrastructure typically offers easy-to-use means to compare, and share their digital results of science, such as
define complex analytical processes and workflows, thus research data and research methods, by using a common “dig-
bridging the gap between domain experts and analytical tech- ital laboratory” consisting of agreed-on services and tools.
nology. In many instances, domain experts may become a However, the implementation of workflows, possibly fol-
reference for their scientific communities, thus facilitating lowing Open Science principles of reproducibility and trans-
new users engagement within the RI activities. As a collat- parency, is hindered by a multitude of real-world problems.
eral feedback effect, experiments will generate new relevant One of the most prominent is that e-infrastructures available
data, methods, and workflows that can be integrated into to research communities today are far from being well-
the platform by data scientists, contributing to the resource designed and consistent digital laboratories, neatly designed
expansion of the RI. An experiment designed in a node of to share and reuse resources according to common policies,
the RI and executed on the platform returns its results to the data models, standards, language platforms, and APIs. They
entire RI community. are instead “patchworks of systems,” assembling online tools,
Well defined thematic environments amplify new experi-
ments achievements towards the vertical scientific commu- 7

Fig. 4 The SoBigData Research Infrastructure: an ecosystem of human and digital resources, comprising data scientists, analytical methods, and
processes. SoBigData enables multidisciplinary scientists and innovators to carry out experiments and to make them reusable by the community

services, and data sources and evolving to match the require- D4Science is a deployed instance of the gCube8 technol-
ments of the scientific process, to include new solutions. The ogy [4], a software conceived to facilitate the integration of
degree of heterogeneity excludes the adoption of uniform web services, code, and applications as resources of differ-
workflow management systems, standard service-oriented ent types in a common framework, which in turn enables the
approaches, routine monitoring and accounting methods. construction of Virtual Research Environments (VREs) [7]
The realization of scientific workflows is typically realized as combinations of such resources (Fig. 5). As there is no
by writing ad hoc code, manipulating data on desktops, common framework that can be trusted enough, sustained
alternating the execution of online web services, sharing soft- enough, to convince resource providers that converging to
ware libraries implementing research methods in different it would be a worthwhile effort, D4Science implements a
languages, desktop tools, web-accessible execution engines “system of systems.” In such a framework, resources are inte-
(e.g., Taverna, Knime, Galaxy). grated with minimal cost, to gain in scalability, performance,
The SoBigData e-infrastructure is based on D4Science accounting, provenance tracking, seamless integration with
services, which provides researchers and practitioners with other resources, visibility to all scientists. The principle
a working environment where open science practices are is that the cost of “participation” to the framework is on
transparently promoted, and data science practices can be the infrastructure rather than on resource providers. The
implemented by minimizing the technological integration infrastructure provides the necessary bridges to include and
cost highlighted above. combine resources that would otherwise be incompatible.


Fig. 5 D4Science: resources

from external systems, virtual
research environments, and

More specifically, via D4Science, SoBigData scientists their research activities (actions, authorship, provenance), as
can integrate and share resources such as datasets, research well as products and links between them (lineage) resulting
methods, web services via APIs, and web applications via from every phase of the research life cycle, thus facilitating
Portlets. Resources can then be integrated, combined, and publishing of science according to Open Science principles
accessed via VREs, intended as web-based working envi- of transparency and reproducibility [5].
ronments tailored to support the needs of their designated Today, SoBigData integrates the resources in Table 1.
communities, each working on a research question. Research By means of such resources, SoBigData scientists have cre-
methods are integrated as executable code, implementing ated VREs to deliver the so-called SoBigData exploratories:
WPS APIs in different programming languages (e.g., Java, Explainable Machine Learning, Sports Data Science, Migra-
Python, R, Knime, Galaxy), which can be executed via the tion Studies, Societal Debates, Well-being & Economy, and
Data Miner analytics platform in parallel, transparently to the City of Citizens. Each exploratory includes the resources
users, over powerful and extensible clusters, and via simple required to perform Data science workflows in a controlled
VRE user interfaces. Scientists using Data Miner in the con- and shared environment. Resources range from data to meth-
text of a VRE can select and execute the available methods ods, described more in detail in the following, together with
and share the results with other scientists, who can repeat or their exploitation within the exploratories.
reproduce the experiment with a simple click. All the resources and instruments integrate into So-
D4Science VREs are equipped with core services sup- BigData RI are structured in such a way as to operate within
porting data analysis and collaboration among its users: (i) the confines of the current data protection law with the focus
a shared workspace to store and organize any version of on General Data Protection Regulation (GDPR) and ethi-
a research artifact; (ii) a social networking area to have cal analysis of the fundamental values involved in social
discussions on any topic (including working version and mining and AI. Each item into the catalogue has specific
released artifacts) and be informed on happenings; (iii) a fields for managing ethical issues (e.g., if a dataset contains
Data Miner analytics platform to execute processing tasks personal info) and fields for describing and managing intel-
(research methods) either natively provided by VRE users lectual properties.
or borrowed from other VREs to be applied to VRE users’
cases and datasets; and iv) a catalogue-based publishing 4.1.2 Data resources: social mining and big data ecosystem
platform to make the existence of a certain artifact public
and disseminated. Scientists operating within VREs use such SoBigData RI defines policies supporting users in the collec-
facilities continuously and transparently track the record of tion, description, preservation, and sharing of their data sets.

Table 1 SoBigData resources (Jul 2020) methods that employ a wide variety of data sources to build
Type Number models about the mobility of people and city characteristics
in the scientific literature [30,32]. Like ecosystems, cities are
Datasets 91 open systems that live and develop utilizing flows of energy,
Methods 83 matter, and information. What distinguishes a city from a
Web applications 9 colony is the human component (i.e., the process of trans-
Training material 22 formation by cultural and technological evolution). Through
Executable methods 68 this combination, cities are evolutionary systems that develop
and co-evolve continuously with their inhabitants [24]. Cities
are kaleidoscopes of information generated by a myriad of
It implements data science making such data available for digital devices weaved into the urban fabric. The inclusion of
collaborative research by adopting various strategies, ranging tracking technologies in personal devices enabled the anal-
from sharing the open data sets with the scientific community ysis of large sets of mobility data like GPS traces and call
at large, to share the data with disclosure restriction allowing detail records.
data access within secure environments. Data science applied to human mobility is one of the
Several big data sets are available through SoBigData RI critical topics investigated in SoBigData thanks to the decen-
including network graphs from mobile phone call data; net- nial experience of partners in European projects. The study
works crawled from many online social networks, including of human mobility led to the integration into the SoBig-
Facebook and Flickr, transaction micro-data from diverse Data of unique Global Positioning System (GPS) and call
retailers, query logs both from search engines and e- detail record (CDR) datasets of people and vehicle move-
commerce, society-wide mobile phone call data records, ments, and geo-referenced social network data as well as
GPS tracks from personal navigation devices, survey data several mobility services: O/D (origin-destination) matrix
about customer satisfaction or market research, extensive computation, Urban Mobility Atlas10 (a visual interface to
web archives, billions of tweets, and data from location- city mobility patterns), GeoTopics11 (for exploring patterns
aware social networks. of urban activity from Foursquare), and predictive mod-
els: MyWay12 (trajectory prediction), TripBuilder13 (tourists
4.1.3 Data science through SoBigData exploratories to build personalized tours of a city). In human mobility,
research questions come from geographers, urbanists, com-
Exploratories are thematic environments built on top of the plexity scientists, data scientists, policymakers, and Big Data
SoBigData RI. An exploratory binds datasets with social providers, as well as innovators aiming to provide applica-
mining methods providing the research context for support- tions for any service for the smart city ecosystem. The idea
ing specific data science applications by: (i) providing the is to investigate the impact of political events on the well-
scientific context for performing the application. This context being of citizens. This exploratory supports the development
can be considered a container for binding specific methods, of “happiness” and “peace” indicators through text min-
applications, services, and datasets; (ii) stimulating commu- ing/opinion mining pipeline on repositories of online news.
nities on the effectiveness of the analytical process related These indicators reveal that the level of crime of a territory
to the analysis, promoting scientific dissemination, result can be well approximated by analyzing the news related to
sharing, and reproducibility. The use of exploratories pro- that territory. Generally, we study the impact of the econ-
motes the effectiveness of the data science trough research omy on well-being and vice versa, e.g., also considering the
infrastructure services. The following sections report a short propagation of shocks of financial distress in an economic
description of the six SoBigData exploratories. Figure 6 or financial system crucially depends on the topology of the
shows the main thematic areas covered by each exploratory. network interconnecting the different elements.
Due to its nature, Explainable Machine Learning exploratory Well-being and economy. This exploratory tests the hypothe-
can be applied to each sector where a black-box machine sis that well-being is correlated to the business performance
learning approach is used. The list of exploratories (and the of companies. The idea is to combine statistical methods and
data and methods inside them) are updated continuously and
continue to grow over time.9 10
City of citizens. This exploratory aims to collect data science 11\discretionary-
applications and methods related to geo-referenced data. The geotopics_-_a_method_and_system_to_explore_urban_activity.
latter describes the movements of citizens in a city, a territory, 12\discretionary-
or an entire region. There are several studies and different myway_-_trajectory_prediction.
9 tripbuilder.

Fig. 6 SoBigData covers six

thematic areas listed
horizontally. Each exploratory
covers more than one thematic

traditional economic data (typically at low-frequency) with icymakers with the unprecedented opportunity to nowcast
high-frequency data from non-traditional sources, such as, relevant economic quantities and compare different coun-
i.e., web, supermarkets, for now-casting economic, socioe- tries, regions, and cities. On the other hand, this allows us to
conomic and well-being indicators. These indicators allow us investigate the network underlying the complex systems of
to study and measure real-life costs by studying price vari- economy and finance, and it affects the aggregate output, the
ation and socioeconomic status inference. Furthermore, this propagation of shocks or financial distress and systemic risk.
activity supports studies on the correlation between people’s Societal debates. This exploratory employs data science
well-being and their social and mobility data. In this con- approaches to answer research questions such as who is
text, some basic hypothesis can be summarized as: (i) there participating in public debates? What is the “big picture”
are curves of age- and gender-based segregation distribution response from citizens to a policy, election, referendum, or
in boards of companies, which are characteristic to mean other political events? This kind of analysis allows scien-
credit risk of companies in a region; (ii) low mean credit risk tists, policymakers, and citizens to understand the online
of companies in a region has a positive correlation to well- discussion surrounding polarized debates [14]. The personal
being; (iii) systemic risk correlates highly with well-being perception of online discussions on social media is often
indices at a national level. The final aim is to provide a set biased by the so-called filter bubble, in which automatic cura-
of guidelines to national governments, methods, and indices tion of content and relationships between users negatively
for decision making on regulations affecting companies to affects the diversity of opinions available to them. Making
improve well-being in the country, also considering effective a complete analysis of online polarized debates enables the
policies to reduce operational risks such as credit risk, and citizens to be better informed and prepared for political out-
external threats of companies [17]. comes. By analyzing content and conversations on social
Big Data, analyzed through the lenses of data science, pro- media and newspaper articles, data scientists study public
vides means to understand our complex socioeconomic and debates and also assess public sentiment around debated top-
financial systems. On the one hand, this offers new oppor- ics, opinion diffusion dynamics, echo chambers formation
tunities to measure the patterns of well-being and poverty at and polarized discussions, fake news analysis, and propa-
a local and global scale, empowering governments and pol- ganda bots. Misinformation is often the result of a distorted

perception of concepts that, although unrelated, suddenly aims to build combined integration indexes that take into
appear together in the same narrative. Understanding the account multiple data sources to evaluate integration on var-
details of this process at an early stage may help to prevent ious levels. Such integration includes mobile phone data
the birth and the diffusion of fake news. The misinformation to understand patterns of communication between immi-
fight includes the development of dynamical models of mis- grants and natives; social network data to assess sentiment
information diffusion (possibly in contrast to the spread of towards immigrants and immigration; professional network
mainstream news) as well as models of how attention cycles data (such as LinkedIn) to understand labor market integra-
are accelerated and amplified by the infrastructures of online tion, and local data to understand to what extent moving
media. across borders is associated with a change in the cultural
Another important topic covered by this exploratory con- norms of the migrants. These indexes are fundamental to
cerns the analysis of how social bots activity affects fake news evaluate the overall social and economic effects of immigra-
diffusion. Determining whether a human or a bot controls a tion. The new integration indexes can be applied with various
user account is a complex task. To the best of our knowledge, space and time resolutions (small area methods) to obtain
the only openly accessible solution to detect social bots is a complete image of integration, and complement official
Botometer, an API that allows us to interact with an underly- index.
ing machine learning system. Although Botometer has been Sports data science. The proliferation of new sensing tech-
proven to be entirely accurate in detecting social bots, it has nologies that provide high-fidelity data streams extracted
limitations due to the Twitter API features: hence, an algo- from every game, is changing the way scientists, fans and
rithm overcoming the barriers of current recipes is needed. practitioners conceive sports performance. The combination
The resources related to Societal Debates exploratory, of these (big) data with the tools of data science provides
especially in the domain of media ecology and the fight the possibility to unveil complex models underlying sports
against misinformation online, provide easy-to-use services performance and enables to perform many challenging tasks:
to public bodies, media outlets, and social/political scientists. from automatic tactical analysis to data-driven performance
Furthermore, SoBigData supports new simulation models ranking; game outcome prediction, and injury forecasting.
and experimental processes to validate in vivo the algorithms The idea is to foster research on sports data science in sev-
for fighting misinformation, curbing the pathological accel- eral directions. The application of explainable AI and deep
eration and amplification of online attention cycles, breaking learning techniques can be hugely beneficial to sports data
the bubbles, and explore alternative media and information science. For example, by using adversarial learning, we can
ecosystems. modify the training plans of players that are associated with
Migration studies. Data science is also useful to understand high injury risk and develop training plans that maximize
the migration phenomenon. Knowledge about the number of the fitness of players (minimizing their injury risk). The use
immigrants living in a particular region is crucial to devise of gaming, simulation, and modeling is another set of tools
policies that maximize the benefits for both locals and immi- that can be used by coaching staff to test tactics that can be
grants. These numbers can vary rapidly in space and time, employed against a competitor. Furthermore, by using deep
especially in periods of crisis such as wars or natural disas- learning on time series, we can forecast the evolution of the
ters. performance of players and search for young talents.
This exploratory provides a set of data and tools for trying This exploratory examines the factors influencing sports
to answer some questions about migration flows. Through success and how to build simulation tools for boosting both
this exploratory, a data scientist studies economic models of individual and collective performance. Furthermore, this
migration and can observe how migrants choose their desti- exploratory describes performances employing data, statis-
nation countries. A scientist can discover what is the meaning tics, and models, allowing coaches, fans, and practitioners to
of “opportunities” that a country provides to migrants, and understand (and boost) sports performance [42].
whether there are correlations between the number of incom- Explainable machine learning. Artificial Intelligence, increas-
ing migrants and opportunities in the host countries [8]. ingly based on Big Data analytics, is a disruptive technology
Furthermore, this exploratory tries to understand how public of our times. This exploratory provides a forum for studying
perception of migration is changing using an opinion mining effects of AI on the future society. In this context, SoBigData
analysis. For example, social network analysis enables us to studies the future of labor and the workforce, also through
analyze the migrant’s social network and discover the struc- data- and model-driven analysis, simulations, and the devel-
ture of the social network for people who decided to start a opment of methods that construct human understandable
new life in a different country [28]. explanations of AI black-box models [20].
Finally, we can also evaluate current integration indices Black box systems for automated decision making map a
based on official statistics and survey data, which can user’s features into a class that predicts the behavioral traits of
be complemented by Big Data sources. This exploratory individuals, such as credit risk, health status, without expos-

ing the reasons why. Most of the time, the internal reasoning to another. Every AI system should operate within an ethi-
of these algorithms is obscure even to their developers. For cal and social framework in understandable, verifiable, and
this reason, the last decade has witnessed the rise of a black justifiable way. Such systems must, in any case, work within
box society. This exploratory is developing a set of tech- the bounds of the rule of law, incorporating protection of
niques and tools which allow data analysts to understand fundamental rights into the AI infrastructure. In other words,
why an algorithm produce a decision. These approaches are the challenge is to develop mechanisms that will result in
designed not for discovering a lack of transparency but also the system converging to an equilibrium that complies with
for discovering possible biases inherited by the algorithms European values and social objectives (e.g., social inclusion)
from human prejudices and artefacts hidden in the training but without unnecessary losses of efficiency.
data (which may lead to unfair or wrong decisions) [35]. Interestingly, data science can play a vital role in enhanc-
ing desirable behaviors in the system, e.g., by supporting
coordination and cooperation that is, more often than not,
5 Conclusions: individual and collective crucial to achieving any meaningful improvements. Our ulti-
intelligence mate goal is to build the blueprint of a sociotechnical system
in which AI not only cooperates with humans but, if nec-
The world’s technological per-capita capacity to store infor- essary, helps them to learn how to collaborate, as well as
mation has roughly doubled every 40 months since the 1980s other desirable behaviors. In this context, it is also essential
[23]. Since 2012, every day 2.5 exabytes (2.5× 101 8 bytes) to understand how to achieve robustness of the human and AI
of data were created; as of 2014, every day 2.3 zettabytes ecosystems in respect of various types of malicious behav-
(2.3×102 1 bytes) of data were generated by Super-power iors, such as abuse of power and exploitation of AI technical
high-tech Corporation worldwide. Soon zettabytes of useful weaknesses.
public and private data will be widely and openly available. In We conclude by paraphrasing Stephen Hawking in his
the next years, smart applications such as smart grids, smart Brief Answers to the Big Questions: the availability of data on
logistics, smart factories, and smart cities will be widely its own will not take humanity to the future, but its intelligent
deployed across the continent and beyond. Ubiquitous broad- and creative use will.
band access, mobile technology, social media, services, and
internet of think on billions of devices will have contributed Acknowledgements This work is supported by the European Com-
munity’s H2020 Program under the scheme ‘INFRAIA-1-2014-2015:
to the explosion of generated data to a total global estimate Research Infrastructures’, grant agreement #654024 ‘SoBigData: Social
of 40 zettabytes. Mining and Big Data Ecosystem’ and the scheme ‘INFRAIA-01-
In this work, we have introduced data science as a new 2018-2019: Research and Innovation action’, grant agreement #871042
challenge and opportunity for the next years. In this context, ’SoBigData++ : European Integrated Infrastructure for Social Mining
and Big Data Analytics’
we have tried to summarize in a concise way several aspects
related to data science applications and their impacts on soci- Funding Information Open access funding provided by Università di
ety, considering both the new services available and the new Pisa within the CRUI-CARE Agreement.
job perspectives. We have also introduced issues in managing
Open Access This article is licensed under a Creative Commons
data representing human behavior and showed how difficult Attribution 4.0 International License, which permits use, sharing, adap-
it is to preserve personal information and privacy. With the tation, distribution and reproduction in any medium or format, as
introduction of SoBigData RI and exploratories, we have pro- long as you give appropriate credit to the original author(s) and the
vided virtual environments where it is possible to understand source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
the potentiality of data science in different research contexts. in this article are included in the article’s Creative Commons licence,
Concluding, we can state that social dilemmas occur when unless indicated otherwise in a credit line to the material. If material
there is a conflict between the individual and public interest. is not included in the article’s Creative Commons licence and your
Such problems also appear in the ecosystem of distributed AI intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
systems (based on data science tools) and humans, with addi- right holder. To view a copy of this licence, visit http://creativecomm
tional difficulties due: on the one hand, to the relative rigidity
of the trained AI systems and the necessity of achieving social
benefit, and, on the other hand, to the necessity of keeping
individuals interested. What are the principles and solutions
International Journal of Data Science and Analytics (2021) 11:263–278 277

