Particle Swarm Optimization is an evolutionary method inspired by the social behaviour of individ... more Particle Swarm Optimization is an evolutionary method inspired by the social behaviour of individuals inside swarms in nature. Solutions of the problem are modelled as members of the swarm which fly in the solution space. The evolution is obtained from the continuous movement of the particles that constitute the swarm submitted to the effect of the inertia and the attraction of the members who lead the swarm. This work focuses on a recent Discrete Particle Swarm Optimization for combinatorial optimization, called Jumping Particle Swarm Optimization. Its effectiveness is illustrated on the minimum labelling Steiner tree problem: given an undirected labelled connected graph, the aim is to find a spanning tree covering a given subset of nodes, whose edges have the smallest number of distinct labels.
Interactive journal of medical research, Apr 5, 2019
The idea of artificial intelligence (AI) has a long history. It turned out, however, that reachin... more The idea of artificial intelligence (AI) has a long history. It turned out, however, that reaching intelligence at human levels is more complicated than originally anticipated. Currently, we are experiencing a renewed interest in AI, fueled by an enormous increase in computing power and an even larger increase in data, in combination with improved AI technologies like deep learning. Healthcare is considered the next domain to be revolutionized by artificial intelligence. While AI approaches are excellently suited to develop certain algorithms, for biomedical applications there are specific challenges. We propose six recommendations-the 6Rs-to improve AI projects in the biomedical space, especially clinical health care, and to facilitate communication between AI scientists and medical doctors: (1) Relevant and well-defined clinical question first; (2) Right data (ie, representative and of good quality); (3) Ratio between number of patients and their variables should fit the AI method; (4) Relationship between data and ground truth should be as direct and causal as possible; (5) Regulatory ready; enabling validation; and (6) Right AI method.
The 5th International Conference on Future Networks & Distributed Systems, Dec 15, 2021
Research around the process of automatic price prediction of stock markets indicates that publish... more Research around the process of automatic price prediction of stock markets indicates that published news are an important asset to solve this problem. We further elaborate on an NLP-based approach to generate industry-specific lexicons from news documents exploiting the distributed technology of Apache Spark, with a focus on individuating on a day-to-day scale the correlation between significant stock price variations and the words collected from press releases. Thereafter we apply a binary classification algorithm that builds upon our newly generated lexicons to predict the magnitude of fluctuation of stock market price. Subsequently, by processing news belonging to a large collection of news articles from the most prestigious press agencies, we validate our approach by conducting an experiment on the market history of the US companies belonging to the Standard & Poor 500 index. We also test the performance of the algorithm on a multi-lingual setting, in particular focusing on the Italian stock market and the Italy 40 (FTSE MIB) index. Final data about classification results let us assess the mutual dependence between terms and prices, and help us evaluating the predictive power of our created lexicons.
Modern financial markets produce massive datasets that need to be analysed using new modelling te... more Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies' bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes' distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification.
ABSTRACT Links in webpages carry an intended semantics: usually, they indicate a relation between... more ABSTRACT Links in webpages carry an intended semantics: usually, they indicate a relation between two things, a subject (something referenced to within the web page) and an object (the target webpage of the link, or something referred to within it). We designed and implemented a novel system, named Legalo, which uncovers the intended semantics of links by defining Semantic Web properties that capture its meaning. Legalo properties can be used for tagging links with semantic relations. The system can be used at http://wit.istc.cnr.it/stlab-tools/legalo.
In application fields such as linguistic and computer vision there is an increasing need of refer... more In application fields such as linguistic and computer vision there is an increasing need of reference data for the empirical analysis of new methods and the assessment of different algorithms. Current evaluations are based on few real-life collections or on artificial data generators built on models that are too simplistic to cover real scenarios and to allow researchers to identify crucial limitations of their algorithms. We propose a flexible approach to generate high-dimensional vectors, with directional properties controlled by the distribution of their pair-wise cosine distances. The generation method is formulated as a non-linear continuous optimization problem, which is solved with a computationally efficient local search algorithm. We show with an empirical study that our approach can create large high-dimensional data collections with desired properties in reasonable time.
Neither the European Commission nor any person acting on behalf of the Commission is responsible ... more Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication. Europe Direct is a service to help you find answers to your questions about the European Union Freephone number (*): 00 800 6 7 8 9 10 11 (*) Certain mobile telephone operators do not allow access to 00 800 numbers or these calls may be billed. A great deal of additional information on the European Union is available on the Internet.
Nowadays, there are plenty of text documents in different domains that have unstructured content ... more Nowadays, there are plenty of text documents in different domains that have unstructured content which makes them hard to analyze automatically. In particular, in the medical domain, this problem is even more stressed and is earning more and more attention. Medical reports may contain relevant information that can be employed, among many useful applications, to build predictive systems able to classify new medical cases thus supporting physicians to take more correct and reliable actions about diagnosis and cares. It is generally hard and time consuming inferring information for comparing unstructured data and evaluating similarities between various resources. In this work we show how it is possible to cluster medical reports, based on features detected by using two emerging tools, IBM Watson and Framester, from a collection of text documents. Experiments and results have proved the quality of the resulting clusterings and the key role that these services can play.
Linked Open Data (LOD) is reaching significant adoption in Public Administrations (PAs), where it... more Linked Open Data (LOD) is reaching significant adoption in Public Administrations (PAs), where it is often required to be connected to existing platforms, such as GIS-based data management. Bearing on previous experience with the pioneering data.cnr.it, through Semantic Scout, as well as Italian DigitPA agency recommendations for LOD in Italian PA, we are working on the extraction, publication, and exploitation of data from the Geographic Information System of the Municipality of Catania, referred to as SIT ("Sistema Informativo Territoriale"). This paper describes the results and lessons learnt from the first campaign, aiming at analysing, reengineering, linking, and formalizing the Shape-based geo-data from the SIT.
Intelligent systems reference library, Jul 4, 2018
During the last decades, a huge amount of data have been collected in clinical databases in the f... more During the last decades, a huge amount of data have been collected in clinical databases in the form of medical reports, laboratory results, treatment plans, etc., representing patients health status. Hence, digital information available for patientoriented decision making has increased drastically but it is often not mined and analyzed in depth since: (i) medical documents are often unstructured and therefore difficult to analyze automatically, (ii) doctors traditionally rely on their experience to recognize an illness, give a diagnosis, and prescribe medications. However doctors experience can be limited by the cases they are treated so far and medication errors can occur frequently. In addition, it is generally hard and time-consuming inferring information for comparing unstructured data and evaluating similarities between heterogeneous resources. Technologies as Data Mining, Natural Language Processing, and Machine Learning can provide possibilities to explore and exploit potential knowledge from diagnosis history records and help doctors to prescribe medication correctly to decrease medication error effectively. In this paper, we design and implement a medical recommender system that is able to cluster a collection of medical reports on features detected by IBM Watson and Framester, two emerging tools from, respectively, Cognitive Computing and Frame Semantics, and then, giving a medical report from a specific patient as input, to recommend similar other medical reports from patients who had analogues symptoms. Experiments and results have proved the quality of the resulting clustering and recommendations, and the key role that these innovative services can play on the biomedical sector. The proposed system is
Chapter "Evaluating the Effects of Chaos in Variable Neighbourhood Search" was previously publish... more Chapter "Evaluating the Effects of Chaos in Variable Neighbourhood Search" was previously published non-open access. It has now been changed to open access under a CC BY 4.0 license and the copyright holder updated to 'The Author(s)'. The book has also been updated with this change.
In this paper, we propose an innovative tool able to enrich cultural and creative spots (gems, he... more In this paper, we propose an innovative tool able to enrich cultural and creative spots (gems, hereinafter) extracted from the European Commission Cultural Gems portal, by suggesting relevant keywords (tags) and YouTube videos (represented with proper thumbnails). On the one hand, the system queries the YouTube search portal, selects the videos most related to the given gem, and extracts a set of meaningful thumbnails for each video. On the other hand, each tag is selected by identifying semantically related popular search queries (i.e., trends). In particular, trends are retrieved by querying the Google Trends platform. A further novelty is that our system suggests contents in a dynamic way. Indeed, as for both YouTube and Google Trends platforms the results of a given query include the most popular videos/trends, such that a gem may constantly be updated with trendy content by periodically running the tool. The system has been tested on a set of gems and evaluated with the support of human annotators. The results highlighted the effectiveness of our proposal. INDEX TERMS Computer science in cultural heritage, heterogeneous data analysis, modeling, interlinking, and browsing, semantic-aware representation of cultural data, machine learning, social media.
1 ABOUT Given the increasing adoption of personal health services and devices, research on smart ... more 1 ABOUT Given the increasing adoption of personal health services and devices, research on smart personal health interfaces is a hot topic for the communities of AI and human-computer interaction [3, 10, 12]. The availability of conversational interfaces in our environment may lead to a revolution in the home healthcare and health selfmanagement. The conventional means for getting people engaged for change in the health behaviour have been health education and counselling services which does not scale well for wide populations. The first wave of health solutions based on wearables and apps have not been shown to be sufficiently effective for behavior change and health self-management [8, 18]. Counseling is still known to be the most effective intervention to lifestyle diseases. The key element
The Global Data on Events, Location, and Tone (GDELT) is a real time large scale database of glob... more The Global Data on Events, Location, and Tone (GDELT) is a real time large scale database of global human society for open research which monitors worlds broadcast, print, and web news, creating a free open platform for computing on the entire world's media. In this work, we first describe a data crawler, which collects metadata of the GDELT database in real-time and stores them in a big data management system based on Elasticsearch, a popular and efficient search engine relying on the Lucene library. Then, by exploiting and engineering the detailed information of each news encoded in GDELT, we build indicators capturing investor's emotions which are useful to analyse the sovereign bond market in Italy. By using regression analysis and by exploiting the power of Gradient Boosting models from machine learning, we find that the features extracted from GDELT improve the forecast of country government yield spread, relative that of a baseline regression where only conventional regressors are included. The improvement in the fitting is particularly relevant during the period government crisis in May-December 2018.
Particle Swarm Optimization is an evolutionary method inspired by the social behaviour of individ... more Particle Swarm Optimization is an evolutionary method inspired by the social behaviour of individuals inside swarms in nature. Solutions of the problem are modelled as members of the swarm which fly in the solution space. The evolution is obtained from the continuous movement of the particles that constitute the swarm submitted to the effect of the inertia and the attraction of the members who lead the swarm. This work focuses on a recent Discrete Particle Swarm Optimization for combinatorial optimization, called Jumping Particle Swarm Optimization. Its effectiveness is illustrated on the minimum labelling Steiner tree problem: given an undirected labelled connected graph, the aim is to find a spanning tree covering a given subset of nodes, whose edges have the smallest number of distinct labels.
Interactive journal of medical research, Apr 5, 2019
The idea of artificial intelligence (AI) has a long history. It turned out, however, that reachin... more The idea of artificial intelligence (AI) has a long history. It turned out, however, that reaching intelligence at human levels is more complicated than originally anticipated. Currently, we are experiencing a renewed interest in AI, fueled by an enormous increase in computing power and an even larger increase in data, in combination with improved AI technologies like deep learning. Healthcare is considered the next domain to be revolutionized by artificial intelligence. While AI approaches are excellently suited to develop certain algorithms, for biomedical applications there are specific challenges. We propose six recommendations-the 6Rs-to improve AI projects in the biomedical space, especially clinical health care, and to facilitate communication between AI scientists and medical doctors: (1) Relevant and well-defined clinical question first; (2) Right data (ie, representative and of good quality); (3) Ratio between number of patients and their variables should fit the AI method; (4) Relationship between data and ground truth should be as direct and causal as possible; (5) Regulatory ready; enabling validation; and (6) Right AI method.
The 5th International Conference on Future Networks & Distributed Systems, Dec 15, 2021
Research around the process of automatic price prediction of stock markets indicates that publish... more Research around the process of automatic price prediction of stock markets indicates that published news are an important asset to solve this problem. We further elaborate on an NLP-based approach to generate industry-specific lexicons from news documents exploiting the distributed technology of Apache Spark, with a focus on individuating on a day-to-day scale the correlation between significant stock price variations and the words collected from press releases. Thereafter we apply a binary classification algorithm that builds upon our newly generated lexicons to predict the magnitude of fluctuation of stock market price. Subsequently, by processing news belonging to a large collection of news articles from the most prestigious press agencies, we validate our approach by conducting an experiment on the market history of the US companies belonging to the Standard & Poor 500 index. We also test the performance of the algorithm on a multi-lingual setting, in particular focusing on the Italian stock market and the Italy 40 (FTSE MIB) index. Final data about classification results let us assess the mutual dependence between terms and prices, and help us evaluating the predictive power of our created lexicons.
Modern financial markets produce massive datasets that need to be analysed using new modelling te... more Modern financial markets produce massive datasets that need to be analysed using new modelling techniques like those from (deep) Machine Learning and Artificial Intelligence. The common goal of these techniques is to forecast the behaviour of the market, which can be translated into various classification tasks, such as, for instance, predicting the likelihood of companies' bankruptcy or in fraud detection systems. However, it is often the case that real-world financial data are unbalanced, meaning that the classes' distribution is not equally represented in such datasets. This gives the main issue since any Machine Learning model is trained according to the majority class mainly, leading to inaccurate predictions. In this paper, we explore different data augmentation techniques to deal with very unbalanced financial data. We consider a number of publicly available datasets, then apply state-of-the-art augmentation strategies to them, and finally evaluate the results for several Machine Learning models trained on the sampled data. The performance of the various approaches is evaluated according to their accuracy, micro, and macro F1 score, and finally by analyzing the precision and recall over the minority class. We show that a consistent and accurate improvement is achieved when data augmentation is employed. The obtained classification results look promising and indicate the efficiency of augmentation strategies on financial tasks. On the basis of these results, we present an approach focused on classification tasks within the financial domain that takes a dataset as input, identifies what kind of augmentation technique to use, and then applies an ensemble of all the augmentation techniques of the identified type to the input dataset along with an ensemble of different methods to tackle the underlying classification.
ABSTRACT Links in webpages carry an intended semantics: usually, they indicate a relation between... more ABSTRACT Links in webpages carry an intended semantics: usually, they indicate a relation between two things, a subject (something referenced to within the web page) and an object (the target webpage of the link, or something referred to within it). We designed and implemented a novel system, named Legalo, which uncovers the intended semantics of links by defining Semantic Web properties that capture its meaning. Legalo properties can be used for tagging links with semantic relations. The system can be used at http://wit.istc.cnr.it/stlab-tools/legalo.
In application fields such as linguistic and computer vision there is an increasing need of refer... more In application fields such as linguistic and computer vision there is an increasing need of reference data for the empirical analysis of new methods and the assessment of different algorithms. Current evaluations are based on few real-life collections or on artificial data generators built on models that are too simplistic to cover real scenarios and to allow researchers to identify crucial limitations of their algorithms. We propose a flexible approach to generate high-dimensional vectors, with directional properties controlled by the distribution of their pair-wise cosine distances. The generation method is formulated as a non-linear continuous optimization problem, which is solved with a computationally efficient local search algorithm. We show with an empirical study that our approach can create large high-dimensional data collections with desired properties in reasonable time.
Neither the European Commission nor any person acting on behalf of the Commission is responsible ... more Neither the European Commission nor any person acting on behalf of the Commission is responsible for the use which might be made of this publication. Europe Direct is a service to help you find answers to your questions about the European Union Freephone number (*): 00 800 6 7 8 9 10 11 (*) Certain mobile telephone operators do not allow access to 00 800 numbers or these calls may be billed. A great deal of additional information on the European Union is available on the Internet.
Nowadays, there are plenty of text documents in different domains that have unstructured content ... more Nowadays, there are plenty of text documents in different domains that have unstructured content which makes them hard to analyze automatically. In particular, in the medical domain, this problem is even more stressed and is earning more and more attention. Medical reports may contain relevant information that can be employed, among many useful applications, to build predictive systems able to classify new medical cases thus supporting physicians to take more correct and reliable actions about diagnosis and cares. It is generally hard and time consuming inferring information for comparing unstructured data and evaluating similarities between various resources. In this work we show how it is possible to cluster medical reports, based on features detected by using two emerging tools, IBM Watson and Framester, from a collection of text documents. Experiments and results have proved the quality of the resulting clusterings and the key role that these services can play.
Linked Open Data (LOD) is reaching significant adoption in Public Administrations (PAs), where it... more Linked Open Data (LOD) is reaching significant adoption in Public Administrations (PAs), where it is often required to be connected to existing platforms, such as GIS-based data management. Bearing on previous experience with the pioneering data.cnr.it, through Semantic Scout, as well as Italian DigitPA agency recommendations for LOD in Italian PA, we are working on the extraction, publication, and exploitation of data from the Geographic Information System of the Municipality of Catania, referred to as SIT ("Sistema Informativo Territoriale"). This paper describes the results and lessons learnt from the first campaign, aiming at analysing, reengineering, linking, and formalizing the Shape-based geo-data from the SIT.
Intelligent systems reference library, Jul 4, 2018
During the last decades, a huge amount of data have been collected in clinical databases in the f... more During the last decades, a huge amount of data have been collected in clinical databases in the form of medical reports, laboratory results, treatment plans, etc., representing patients health status. Hence, digital information available for patientoriented decision making has increased drastically but it is often not mined and analyzed in depth since: (i) medical documents are often unstructured and therefore difficult to analyze automatically, (ii) doctors traditionally rely on their experience to recognize an illness, give a diagnosis, and prescribe medications. However doctors experience can be limited by the cases they are treated so far and medication errors can occur frequently. In addition, it is generally hard and time-consuming inferring information for comparing unstructured data and evaluating similarities between heterogeneous resources. Technologies as Data Mining, Natural Language Processing, and Machine Learning can provide possibilities to explore and exploit potential knowledge from diagnosis history records and help doctors to prescribe medication correctly to decrease medication error effectively. In this paper, we design and implement a medical recommender system that is able to cluster a collection of medical reports on features detected by IBM Watson and Framester, two emerging tools from, respectively, Cognitive Computing and Frame Semantics, and then, giving a medical report from a specific patient as input, to recommend similar other medical reports from patients who had analogues symptoms. Experiments and results have proved the quality of the resulting clustering and recommendations, and the key role that these innovative services can play on the biomedical sector. The proposed system is
Chapter "Evaluating the Effects of Chaos in Variable Neighbourhood Search" was previously publish... more Chapter "Evaluating the Effects of Chaos in Variable Neighbourhood Search" was previously published non-open access. It has now been changed to open access under a CC BY 4.0 license and the copyright holder updated to 'The Author(s)'. The book has also been updated with this change.
In this paper, we propose an innovative tool able to enrich cultural and creative spots (gems, he... more In this paper, we propose an innovative tool able to enrich cultural and creative spots (gems, hereinafter) extracted from the European Commission Cultural Gems portal, by suggesting relevant keywords (tags) and YouTube videos (represented with proper thumbnails). On the one hand, the system queries the YouTube search portal, selects the videos most related to the given gem, and extracts a set of meaningful thumbnails for each video. On the other hand, each tag is selected by identifying semantically related popular search queries (i.e., trends). In particular, trends are retrieved by querying the Google Trends platform. A further novelty is that our system suggests contents in a dynamic way. Indeed, as for both YouTube and Google Trends platforms the results of a given query include the most popular videos/trends, such that a gem may constantly be updated with trendy content by periodically running the tool. The system has been tested on a set of gems and evaluated with the support of human annotators. The results highlighted the effectiveness of our proposal. INDEX TERMS Computer science in cultural heritage, heterogeneous data analysis, modeling, interlinking, and browsing, semantic-aware representation of cultural data, machine learning, social media.
1 ABOUT Given the increasing adoption of personal health services and devices, research on smart ... more 1 ABOUT Given the increasing adoption of personal health services and devices, research on smart personal health interfaces is a hot topic for the communities of AI and human-computer interaction [3, 10, 12]. The availability of conversational interfaces in our environment may lead to a revolution in the home healthcare and health selfmanagement. The conventional means for getting people engaged for change in the health behaviour have been health education and counselling services which does not scale well for wide populations. The first wave of health solutions based on wearables and apps have not been shown to be sufficiently effective for behavior change and health self-management [8, 18]. Counseling is still known to be the most effective intervention to lifestyle diseases. The key element
The Global Data on Events, Location, and Tone (GDELT) is a real time large scale database of glob... more The Global Data on Events, Location, and Tone (GDELT) is a real time large scale database of global human society for open research which monitors worlds broadcast, print, and web news, creating a free open platform for computing on the entire world's media. In this work, we first describe a data crawler, which collects metadata of the GDELT database in real-time and stores them in a big data management system based on Elasticsearch, a popular and efficient search engine relying on the Lucene library. Then, by exploiting and engineering the detailed information of each news encoded in GDELT, we build indicators capturing investor's emotions which are useful to analyse the sovereign bond market in Italy. By using regression analysis and by exploiting the power of Gradient Boosting models from machine learning, we find that the features extracted from GDELT improve the forecast of country government yield spread, relative that of a baseline regression where only conventional regressors are included. The improvement in the fitting is particularly relevant during the period government crisis in May-December 2018.
This book seeks to promote the exploitation of data science in healthcare systems. The focus is o... more This book seeks to promote the exploitation of data science in healthcare systems. The focus is on advancing the automated analytical methods used to extract new knowledge from data for healthcare applications. To do so, the book draws on several interrelated disciplines, including machine learning, big data analytics, statistics, pattern recognition, computer vision, and Semantic Web technologies, and focuses on their direct application to healthcare. Building on three tutorial-like chapters on data science in healthcare, the following eleven chapters highlight success stories on the application of data science in healthcare, where data science and artificial intelligence technologies have proven to be very promising. This book is primarily intended for data scientists involved in the healthcare or medical sector. By reading this book, they will gain essential insights into the modern data science technologies needed to advance innovation for both healthcare businesses and patients. A basic grasp of data science is recommended in order to fully benefit from this book.
Uploads
Papers by Sergio Consoli