Data Warehouses (DWs) with large quantities of data present major performance and scalability cha... more Data Warehouses (DWs) with large quantities of data present major performance and scalability challenges, and parallelism can be used for major performance improvement in such context. However, instead of costly specialized parallel hardware and interconnections, we focus on low-cost standard computing nodes, possibly in a non-dedicated local network. In this environment, special care must be taken with partitioning and processing. We use experimental evidence to analyze the shortcomings of a basic horizontal partitioning strategy designed for that environment, then propose and test improvements to allow
Today deep learning techniques (DL) are the main focus in classification of disease conditions fr... more Today deep learning techniques (DL) are the main focus in classification of disease conditions from histology slides, but this task used to be done by more traditional machine learning pipeline algorithms (MLp). The first can learn autonomously, without any feature engineering. But some questions arise: can we design a fully automated MLp? Can that MLp match DL, at least in some tasks? how should it be designed? Can both be useful and/or complement each other? In this chapter we try to answer those questions. In the process, we design an automated MLp, build DL architectures, apply both to cancer grading, compare accuracy experimentally and discuss the remaining issues. Surprisingly, a carefully designed MLp procedure (acc. 86.5%) compared favorably to deep learning (best acc. 82%) and to humans (acc. 84%) when detecting degree of atypia for breast cancer prognosis on limited-sized publicly available Mytos dataset, with the same DL architectures that achieved accuracies of 97% on a different cancer classification task. Most importantly, we discuss advantages and limitations of alternatives, in particular what features make DL superior and may justify that choice, but also how MLp can be almost fully automated and produce useful structures characterization. Finally, we raise challenges, identifying how MLp and DL should evolve to offer explainability and integrate humans in the loop.
Quality of service is a key issue in current and future computer systems. Applications run on sys... more Quality of service is a key issue in current and future computer systems. Applications run on systems and typically access a backend DBMS which doesn’t provide performance-related QoS guarantees, and whose resources are constrained. Throughput increases with the number of concurrent transactions until it reaches a saturation point (optimal EC) where more concurrent transactions lead to a drop in throughput. In order to sustain quality of service, it is essential to use admission control mechanisms that manage congestion and do not admit transactions that can’t be executed within relevant quality-of-service constraints, thus avoiding the degradation of performance for running transactions. In this paper we use a QoS Brokering architecture to enhance DBMS with QoS capabilities that allow the system to deliver QoS guarantees while sustaining large throughputs with reduced miss ratios. Experimental results are obtained using the TPC-C transactional benchmark.
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Deep Learning (DL) is increasingly used in every medical imaging segmentation task with great res... more Deep Learning (DL) is increasingly used in every medical imaging segmentation task with great results in terms of illness detection, but precise segmentation of lesions areas and contours from eye fundus images (EFI) is more challenging than the more common detection of whether there are lesions in images, for referral, or classification of lesion given a small window. We investigate the difficulties and how the choice of architecture, loss function and data augmentation can improve the quality segmenting precisely small and difficult to detect lesions, achieving results. Our best IoU scores were, for different classes, (BK:98%, MA: 16%, HA:28%, HE:61%, SE:51%, OD:91%). These results and discussion also show that research is still required to improve results further.
The corona virus responsible for COVID-19 has come into our lives with huge stampede. Every human... more The corona virus responsible for COVID-19 has come into our lives with huge stampede. Every human activity has been seriously hurt and millions were confined to their homes. As of March, people in Europe wonder whether the confinement, closures and no-flights policies are effective or how effective they are, in spite of the positive previous example of China. In this paper we present our analysis specifically focused at detecting whether the new daily cases curves are in a stabilization route or exploding. This required a set of steps for data processing and analysis that we describe in detail. The conclusion is that, as of 22 March, the curves were in a trajectory of stabilization and possible decrease soon. We show why, also finding a most probable correlation with confinement and other government policies.
International Journal of Business Process Integration and Management, 2015
Data mining is the process of discovering patterns in large datasets. With the exponential growth... more Data mining is the process of discovering patterns in large datasets. With the exponential growth of available information, new machine learning, statistics and other analytics techniques have to be developed to solve the processing needs required to do such analysis fast enough to be used successfully. In this study, techniques like cluster analysis are used over generated data in order to do customer segmentation, and the system performance is evaluated by measuring the processing time. The data used in the current paper is generated using the Star Schema Benchmark (SSB). Our main goal is to find a scalable solution to run data mining over a decision support benchmark. Four different systems will be tested: single node MySQL, MySQL cluster, Apache Mahout and R. By running MySQL cluster and Mahout, each system distributed by four nodes, the paper compares the performance of k-means run in parallel. MySQL and R will allow for comparison of this kind of execution against methods running on a single machine, both on relational and non-relational systems.
This visionary track theme focuses on dependability and security of software and services. Servic... more This visionary track theme focuses on dependability and security of software and services. Service-based systems are being used in business and safety-critical environments to achieve operational goals and possess special characteristics that have bring difficult challenges to the research and industry communities. Among these challenges, dependability and security have been widely identified as critical aspects that need to be addressed, especially when considering that services are being deployed on the web, and used over unreliable networks to perform critical functions.
Prediction of consumption has several applications in industry, including to support strategic de... more Prediction of consumption has several applications in industry, including to support strategic decisions, market offering, and value proposition. In telecommunications industry, it can also be used in network resources management and in guaranteeing quality of service to users. But in order to make good predictions, one should choose the algorithm that is best fitted to the considered time series and also configures the parameters correctly. In this chapter, we discuss the use of time series forecasting algorithms over telecommunications data. We evaluate the use of Auto-Regressive Integrated Moving Average (ARIMA), Prophet (launched by Facebook in 2017), and two neural network algorithms: Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM). We ran those algorithms over real data about Internet data consumption and mobile phone cards recharges, in order to forecast time periods of distinct sizes. Forecasted values were qualified in terms of Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). Obtained results show that ARIMA is the algorithm that is best suited to most cases.
Big data platforms strive to achieve scalability and realtime for query processing and complex an... more Big data platforms strive to achieve scalability and realtime for query processing and complex analytics over “big” and/or “fast” data. In this context, big data warehouses are huge repositories of data to be used in analytics and machine learning. This work discusses models, concepts and approaches to reach scalability and realtime in big data processing and big data warehouses. The main concepts of NoSQL, Parallel Data Management Systems (PDBMS), MapReduce and Spark are reviewed in the context of scalability. The first two offering data management, the last two adding flexible and scalable processing capacities. We also turn our attention to realtime data processing, lambda architecture and its relation with scalability, and we revisit our own recent research on the issue. Three approaches are included that are directly related to realtime and scalability: the use of a realtime component in a data warehouse, parallelized de-normalization for scalability and execution tree sharing for scaling to simultaneous sessions. With these models and technologies we revisit some of the major current solutions for data management and data processing with scalability and realtime capacities.
2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), 2018
Diabetic Retinopathy (DR) is diagnosed based on analysis of characteristic lesions in Eye Fundus ... more Diabetic Retinopathy (DR) is diagnosed based on analysis of characteristic lesions in Eye Fundus Images (EFI). Early stages are characterized mostly by the existence of micro-aneurisms. As the disease progresses, hemorrhages and exudates become apparent. In later proliferative DR, neo-vascularization and other anomalies complete the set of symptoms. Automated classification can be applied to identify lesions, but highly effective features are required. In this paper we propose an approach to create and elect "best features" and "best classification". The concepts of contrast-relative features and extensive multi-type feature sets are proposed, and a three-step approach to arrive at best features and accurate classification is described. Experimental results validate the approach, show that contrast-relative features are especially well suited, identify a disparate set of most accurate features and the best classifiers to use.
The 7th International conference on Time Series and Forecasting, 2021
Epidemiology maths resorts to Susceptible-Infected-Recovered (SIR)-like models to describe contag... more Epidemiology maths resorts to Susceptible-Infected-Recovered (SIR)-like models to describe contagion evolution curves for diseases such as Covid-19. Other time series estimation approaches can be used to fit and forecast curves. We use data from the Covid-19 pandemic infection curves of 20 countries to compare forecasting using SEIR (a variant of SIR), polynomial regression, ARIMA and Prophet. Polynomial regression deg2 (POLY d(2)) on differentiated curves had lowest 15 day forecast errors (6% average error over 20 countries), SEIR (errors 25–68%) and ARIMA (errors 15–85%) were better for spans larger than 30 days. We highlight the importance of SEIR for longer terms, and POLY d(2) in 15-days forecasting.
Thirteenth International Conference on Digital Image Processing (ICDIP 2021), 2021
Deep learning is increasingly used in every medical imaging segmentation task, but detection of l... more Deep learning is increasingly used in every medical imaging segmentation task, but detection of lesions from eye fundus images (EFI) poses many difficult challenges related to sizes, similarity with other lesions and structures, low contrasts, variant conformations. During training, the loss function directs backpropagation learning in the deep convolutional neural networks (DCNN) that are used. It is therefore a fundamental function to the optimization procedure. There exist alternative formulations, such as cross entropy, jaccard and dice. But does the choice of loss influence quality decisively, in the difficult context of EFI lesions? And what about the network architecture? As part of our effort to improve the approaches, we evaluate alternative loss functions, also alternative architectures. We show that the choice of a convenient architecture and loss function can double the quality detecting some of the small and difficult to detect lesions, but we also show that research is still required to find ways to improve the results further.
Deep Learning outperforms prior art in medical imaging tasks. It has been applied to segmentation... more Deep Learning outperforms prior art in medical imaging tasks. It has been applied to segmentation of Magnetic Resonance Imaging (MRI) scans, where consecutive slices capture relevant body structures for visualization and diagnosis of medical condition. In this work we investigate experimentally the factors that improve segmentation performance of MRI sequences of abdominal organs, including network architecture, pre-training, data augmentation and improvements to loss function. After comparing segmentation network architectures, we choose the best performing one and experimented improvements (data augmentation, training choices). Finally, metrics are fundamental and IoU of each organ in particular, therefore we change loss function to IoU and evaluate the resulting quality. We show that DeepLabV3 is better than competitors by 20 percentage points (pp) or more (depending on the competitor), data augmentation and further enhancements improve performance of DeepLabV3 by 12 percentage p...
Automated analysis of histological images helps diagnose and further classify breast cancer. Tota... more Automated analysis of histological images helps diagnose and further classify breast cancer. Totally automated approaches can be used to pinpoint images for further analysis by the medical doctor. But tissue images are especially challenging for either manual or automated approaches, due to mixed patterns and textures, where malignant regions are sometimes difficult to detect unless they are in very advanced stages. Some of the major challenges are related to irregular and very diffuse patterns, as well as difficulty to define winning features and classifier models. Although it is also hard to segment correctly into regions, due to the diffuse nature, it is still crucial to take low-level features over individualized regions instead of the whole image, and to select those with the best outcomes. In this paper we report on our experiments building a region classifier with a simple subspace division and a feature selection model that improves results over image-wide and/or limited feature sets. Experimental results show modest accuracy for a set of classifiers applied over the whole image, while the conjunction of image division, per-region low-level extraction of features and selection of features, together with the use of a neural network classifier achieved the best levels of accuracy for the dataset and settings we used in the experiments. Future work involves deep learning techniques, adding structures semantics and embedding the approach as a tumor finding helper in a practical Medical Imaging Application.
Segmentation of lesions in eye fundus images (EFI) is a difficult problem, due to small sizes, va... more Segmentation of lesions in eye fundus images (EFI) is a difficult problem, due to small sizes, varying morphologies, similarities and lack of contrast. Today, deep learning segmentation architectures are state-of-the-art in most segmentation tasks. But metrics need to be interpreted adequately to avoid wrong conclusions, e.g. we show that 90% global accuracy of the Fully Convolutional Network (FCN) does not mean it segments lesions very well. In this work we test and compare deep segmentation networks applied to find lesions in the Eye Fundus Images, focusing on comparison and how metrics really should be interpreted to avoid mistakes and why. In the light of this analysis, we finalize by discussing further challenges that lie ahead.
International Conference on Software Engineering Advances, Nov 15, 2015
In this paper, we investigate the problem of providing scalability to data Extraction, Transforma... more In this paper, we investigate the problem of providing scalability to data Extraction, Transformation, Load and Querying (ETL+Q) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically. Parallel architectures and mechanisms are able to optimize the ETL process by speedingup each part of the pipeline process as more performance is needed. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process, suitable for smallData and bigData business. A general framework for testing and implementing the system was developed to provide solutions for each part of the ETL+Q automatic scalability. The results show that the proposed system is capable of handling scalability to provide the desired processing speed for both near-real-time results and offline ETL+Q processing.
Computer Vision and Image Understanding, Feb 1, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Communications in computer and information science, 2018
Over the past decade’s several new concepts emerged to organize and query data over large Data Wa... more Over the past decade’s several new concepts emerged to organize and query data over large Data Warehouse (DW) system with the same primary objective, that is, optimize processing speed. More recently, with the rise of BigData concept, storage cost lowered significantly, and performance (random accesses) increased, particularly with modern SSD disks. This paper introduces and tested a storage alternative which goes against current data normalization premises, where storage space is no longer a concern. By de-normalizing the entire data schema (transparent to the user) it is proposed a new concept system where query execution time must be entirely predictable, independently of its complexity, called, SINGLE. The proposed data model also allows easy partitioning and distributed processing to enable execution parallelism, boosting performance, as happens in MapReduce. TPC-H benchmark is used to evaluate storage space and query performance. Results show predictable performance when comparing with approaches based on a normalized relational schema, and MapReduce oriented.
Data Warehouses (DWs) with large quantities of data present major performance and scalability cha... more Data Warehouses (DWs) with large quantities of data present major performance and scalability challenges, and parallelism can be used for major performance improvement in such context. However, instead of costly specialized parallel hardware and interconnections, we focus on low-cost standard computing nodes, possibly in a non-dedicated local network. In this environment, special care must be taken with partitioning and processing. We use experimental evidence to analyze the shortcomings of a basic horizontal partitioning strategy designed for that environment, then propose and test improvements to allow
Today deep learning techniques (DL) are the main focus in classification of disease conditions fr... more Today deep learning techniques (DL) are the main focus in classification of disease conditions from histology slides, but this task used to be done by more traditional machine learning pipeline algorithms (MLp). The first can learn autonomously, without any feature engineering. But some questions arise: can we design a fully automated MLp? Can that MLp match DL, at least in some tasks? how should it be designed? Can both be useful and/or complement each other? In this chapter we try to answer those questions. In the process, we design an automated MLp, build DL architectures, apply both to cancer grading, compare accuracy experimentally and discuss the remaining issues. Surprisingly, a carefully designed MLp procedure (acc. 86.5%) compared favorably to deep learning (best acc. 82%) and to humans (acc. 84%) when detecting degree of atypia for breast cancer prognosis on limited-sized publicly available Mytos dataset, with the same DL architectures that achieved accuracies of 97% on a different cancer classification task. Most importantly, we discuss advantages and limitations of alternatives, in particular what features make DL superior and may justify that choice, but also how MLp can be almost fully automated and produce useful structures characterization. Finally, we raise challenges, identifying how MLp and DL should evolve to offer explainability and integrate humans in the loop.
Quality of service is a key issue in current and future computer systems. Applications run on sys... more Quality of service is a key issue in current and future computer systems. Applications run on systems and typically access a backend DBMS which doesn’t provide performance-related QoS guarantees, and whose resources are constrained. Throughput increases with the number of concurrent transactions until it reaches a saturation point (optimal EC) where more concurrent transactions lead to a drop in throughput. In order to sustain quality of service, it is essential to use admission control mechanisms that manage congestion and do not admit transactions that can’t be executed within relevant quality-of-service constraints, thus avoiding the degradation of performance for running transactions. In this paper we use a QoS Brokering architecture to enhance DBMS with QoS capabilities that allow the system to deliver QoS guarantees while sustaining large throughputs with reduced miss ratios. Experimental results are obtained using the TPC-C transactional benchmark.
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Deep Learning (DL) is increasingly used in every medical imaging segmentation task with great res... more Deep Learning (DL) is increasingly used in every medical imaging segmentation task with great results in terms of illness detection, but precise segmentation of lesions areas and contours from eye fundus images (EFI) is more challenging than the more common detection of whether there are lesions in images, for referral, or classification of lesion given a small window. We investigate the difficulties and how the choice of architecture, loss function and data augmentation can improve the quality segmenting precisely small and difficult to detect lesions, achieving results. Our best IoU scores were, for different classes, (BK:98%, MA: 16%, HA:28%, HE:61%, SE:51%, OD:91%). These results and discussion also show that research is still required to improve results further.
The corona virus responsible for COVID-19 has come into our lives with huge stampede. Every human... more The corona virus responsible for COVID-19 has come into our lives with huge stampede. Every human activity has been seriously hurt and millions were confined to their homes. As of March, people in Europe wonder whether the confinement, closures and no-flights policies are effective or how effective they are, in spite of the positive previous example of China. In this paper we present our analysis specifically focused at detecting whether the new daily cases curves are in a stabilization route or exploding. This required a set of steps for data processing and analysis that we describe in detail. The conclusion is that, as of 22 March, the curves were in a trajectory of stabilization and possible decrease soon. We show why, also finding a most probable correlation with confinement and other government policies.
International Journal of Business Process Integration and Management, 2015
Data mining is the process of discovering patterns in large datasets. With the exponential growth... more Data mining is the process of discovering patterns in large datasets. With the exponential growth of available information, new machine learning, statistics and other analytics techniques have to be developed to solve the processing needs required to do such analysis fast enough to be used successfully. In this study, techniques like cluster analysis are used over generated data in order to do customer segmentation, and the system performance is evaluated by measuring the processing time. The data used in the current paper is generated using the Star Schema Benchmark (SSB). Our main goal is to find a scalable solution to run data mining over a decision support benchmark. Four different systems will be tested: single node MySQL, MySQL cluster, Apache Mahout and R. By running MySQL cluster and Mahout, each system distributed by four nodes, the paper compares the performance of k-means run in parallel. MySQL and R will allow for comparison of this kind of execution against methods running on a single machine, both on relational and non-relational systems.
This visionary track theme focuses on dependability and security of software and services. Servic... more This visionary track theme focuses on dependability and security of software and services. Service-based systems are being used in business and safety-critical environments to achieve operational goals and possess special characteristics that have bring difficult challenges to the research and industry communities. Among these challenges, dependability and security have been widely identified as critical aspects that need to be addressed, especially when considering that services are being deployed on the web, and used over unreliable networks to perform critical functions.
Prediction of consumption has several applications in industry, including to support strategic de... more Prediction of consumption has several applications in industry, including to support strategic decisions, market offering, and value proposition. In telecommunications industry, it can also be used in network resources management and in guaranteeing quality of service to users. But in order to make good predictions, one should choose the algorithm that is best fitted to the considered time series and also configures the parameters correctly. In this chapter, we discuss the use of time series forecasting algorithms over telecommunications data. We evaluate the use of Auto-Regressive Integrated Moving Average (ARIMA), Prophet (launched by Facebook in 2017), and two neural network algorithms: Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM). We ran those algorithms over real data about Internet data consumption and mobile phone cards recharges, in order to forecast time periods of distinct sizes. Forecasted values were qualified in terms of Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). Obtained results show that ARIMA is the algorithm that is best suited to most cases.
Big data platforms strive to achieve scalability and realtime for query processing and complex an... more Big data platforms strive to achieve scalability and realtime for query processing and complex analytics over “big” and/or “fast” data. In this context, big data warehouses are huge repositories of data to be used in analytics and machine learning. This work discusses models, concepts and approaches to reach scalability and realtime in big data processing and big data warehouses. The main concepts of NoSQL, Parallel Data Management Systems (PDBMS), MapReduce and Spark are reviewed in the context of scalability. The first two offering data management, the last two adding flexible and scalable processing capacities. We also turn our attention to realtime data processing, lambda architecture and its relation with scalability, and we revisit our own recent research on the issue. Three approaches are included that are directly related to realtime and scalability: the use of a realtime component in a data warehouse, parallelized de-normalization for scalability and execution tree sharing for scaling to simultaneous sessions. With these models and technologies we revisit some of the major current solutions for data management and data processing with scalability and realtime capacities.
2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), 2018
Diabetic Retinopathy (DR) is diagnosed based on analysis of characteristic lesions in Eye Fundus ... more Diabetic Retinopathy (DR) is diagnosed based on analysis of characteristic lesions in Eye Fundus Images (EFI). Early stages are characterized mostly by the existence of micro-aneurisms. As the disease progresses, hemorrhages and exudates become apparent. In later proliferative DR, neo-vascularization and other anomalies complete the set of symptoms. Automated classification can be applied to identify lesions, but highly effective features are required. In this paper we propose an approach to create and elect "best features" and "best classification". The concepts of contrast-relative features and extensive multi-type feature sets are proposed, and a three-step approach to arrive at best features and accurate classification is described. Experimental results validate the approach, show that contrast-relative features are especially well suited, identify a disparate set of most accurate features and the best classifiers to use.
The 7th International conference on Time Series and Forecasting, 2021
Epidemiology maths resorts to Susceptible-Infected-Recovered (SIR)-like models to describe contag... more Epidemiology maths resorts to Susceptible-Infected-Recovered (SIR)-like models to describe contagion evolution curves for diseases such as Covid-19. Other time series estimation approaches can be used to fit and forecast curves. We use data from the Covid-19 pandemic infection curves of 20 countries to compare forecasting using SEIR (a variant of SIR), polynomial regression, ARIMA and Prophet. Polynomial regression deg2 (POLY d(2)) on differentiated curves had lowest 15 day forecast errors (6% average error over 20 countries), SEIR (errors 25–68%) and ARIMA (errors 15–85%) were better for spans larger than 30 days. We highlight the importance of SEIR for longer terms, and POLY d(2) in 15-days forecasting.
Thirteenth International Conference on Digital Image Processing (ICDIP 2021), 2021
Deep learning is increasingly used in every medical imaging segmentation task, but detection of l... more Deep learning is increasingly used in every medical imaging segmentation task, but detection of lesions from eye fundus images (EFI) poses many difficult challenges related to sizes, similarity with other lesions and structures, low contrasts, variant conformations. During training, the loss function directs backpropagation learning in the deep convolutional neural networks (DCNN) that are used. It is therefore a fundamental function to the optimization procedure. There exist alternative formulations, such as cross entropy, jaccard and dice. But does the choice of loss influence quality decisively, in the difficult context of EFI lesions? And what about the network architecture? As part of our effort to improve the approaches, we evaluate alternative loss functions, also alternative architectures. We show that the choice of a convenient architecture and loss function can double the quality detecting some of the small and difficult to detect lesions, but we also show that research is still required to find ways to improve the results further.
Deep Learning outperforms prior art in medical imaging tasks. It has been applied to segmentation... more Deep Learning outperforms prior art in medical imaging tasks. It has been applied to segmentation of Magnetic Resonance Imaging (MRI) scans, where consecutive slices capture relevant body structures for visualization and diagnosis of medical condition. In this work we investigate experimentally the factors that improve segmentation performance of MRI sequences of abdominal organs, including network architecture, pre-training, data augmentation and improvements to loss function. After comparing segmentation network architectures, we choose the best performing one and experimented improvements (data augmentation, training choices). Finally, metrics are fundamental and IoU of each organ in particular, therefore we change loss function to IoU and evaluate the resulting quality. We show that DeepLabV3 is better than competitors by 20 percentage points (pp) or more (depending on the competitor), data augmentation and further enhancements improve performance of DeepLabV3 by 12 percentage p...
Automated analysis of histological images helps diagnose and further classify breast cancer. Tota... more Automated analysis of histological images helps diagnose and further classify breast cancer. Totally automated approaches can be used to pinpoint images for further analysis by the medical doctor. But tissue images are especially challenging for either manual or automated approaches, due to mixed patterns and textures, where malignant regions are sometimes difficult to detect unless they are in very advanced stages. Some of the major challenges are related to irregular and very diffuse patterns, as well as difficulty to define winning features and classifier models. Although it is also hard to segment correctly into regions, due to the diffuse nature, it is still crucial to take low-level features over individualized regions instead of the whole image, and to select those with the best outcomes. In this paper we report on our experiments building a region classifier with a simple subspace division and a feature selection model that improves results over image-wide and/or limited feature sets. Experimental results show modest accuracy for a set of classifiers applied over the whole image, while the conjunction of image division, per-region low-level extraction of features and selection of features, together with the use of a neural network classifier achieved the best levels of accuracy for the dataset and settings we used in the experiments. Future work involves deep learning techniques, adding structures semantics and embedding the approach as a tumor finding helper in a practical Medical Imaging Application.
Segmentation of lesions in eye fundus images (EFI) is a difficult problem, due to small sizes, va... more Segmentation of lesions in eye fundus images (EFI) is a difficult problem, due to small sizes, varying morphologies, similarities and lack of contrast. Today, deep learning segmentation architectures are state-of-the-art in most segmentation tasks. But metrics need to be interpreted adequately to avoid wrong conclusions, e.g. we show that 90% global accuracy of the Fully Convolutional Network (FCN) does not mean it segments lesions very well. In this work we test and compare deep segmentation networks applied to find lesions in the Eye Fundus Images, focusing on comparison and how metrics really should be interpreted to avoid mistakes and why. In the light of this analysis, we finalize by discussing further challenges that lie ahead.
International Conference on Software Engineering Advances, Nov 15, 2015
In this paper, we investigate the problem of providing scalability to data Extraction, Transforma... more In this paper, we investigate the problem of providing scalability to data Extraction, Transformation, Load and Querying (ETL+Q) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically. Parallel architectures and mechanisms are able to optimize the ETL process by speedingup each part of the pipeline process as more performance is needed. We propose an approach to enable the automatic scalability and freshness of any data warehouse and ETL+Q process, suitable for smallData and bigData business. A general framework for testing and implementing the system was developed to provide solutions for each part of the ETL+Q automatic scalability. The results show that the proposed system is capable of handling scalability to provide the desired processing speed for both near-real-time results and offline ETL+Q processing.
Computer Vision and Image Understanding, Feb 1, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Communications in computer and information science, 2018
Over the past decade’s several new concepts emerged to organize and query data over large Data Wa... more Over the past decade’s several new concepts emerged to organize and query data over large Data Warehouse (DW) system with the same primary objective, that is, optimize processing speed. More recently, with the rise of BigData concept, storage cost lowered significantly, and performance (random accesses) increased, particularly with modern SSD disks. This paper introduces and tested a storage alternative which goes against current data normalization premises, where storage space is no longer a concern. By de-normalizing the entire data schema (transparent to the user) it is proposed a new concept system where query execution time must be entirely predictable, independently of its complexity, called, SINGLE. The proposed data model also allows easy partitioning and distributed processing to enable execution parallelism, boosting performance, as happens in MapReduce. TPC-H benchmark is used to evaluate storage space and query performance. Results show predictable performance when comparing with approaches based on a normalized relational schema, and MapReduce oriented.
Uploads
Papers by Pedro Furtado