Papers by Cesare Furlanello
arXiv (Cornell University), Jan 13, 2012
Due to the ever rising importance of the network paradigm across several areas of science, compar... more Due to the ever rising importance of the network paradigm across several areas of science, comparing and classifying graphs represent essential steps in the networks analysis of complex systems. Both tasks have been recently tackled via quite different strategies, even tailored ad-hoc for the investigated problem. Here we deal with both operations by introducing the Hamming-Ipsen-Mikhailov (HIM) distance, a novel metric to quantitatively measure the difference between two graphs sharing the same vertices. The new measure combines the local Hamming distance and the global spectral Ipsen-Mikhailov distance so to overcome the drawbacks affecting the two components separately. Building then the HIM kernel function derived from the HIM distance it is possible to move from network comparison to network classification via the Support Vector Machine (SVM) algorithm. Applications of HIM distance and HIM kernel in computational biology and social networks science demonstrate the effectiveness of the proposed functions as a general purpose solution.
arXiv (Cornell University), Dec 5, 2013
The functional and structural representation of the brain as a complex network is marked by the f... more The functional and structural representation of the brain as a complex network is marked by the fact that the comparison of noisy and intrinsically correlated highdimensional structures between experimental conditions or groups shuns typical mass univariate methods. Furthermore most network estimation methods cannot distinguish between real and spurious correlation arising from the convolution due to nodes' interaction, which thus introduces additional noise in the data. We propose a machine learning pipeline aimed at identifying multivariate differences between brain networks associated to different experimental conditions. The pipeline (1) leverages the deconvolved individual contribution of each edge and (2) maps the task into a sparse classification problem in order to construct the associated "sparse deconvolved predictive network", i.e. a graph with the same nodes of those compared but whose edge weights are defined by their relevance for out of sample predictions in classification. We present an application of the proposed method by decoding the covert attention direction (left or right) based on the single-trial functional connectivity matrix extracted from high-frequency magnetoencephalography (MEG) data. Our results demonstrate how network deconvolution matched with sparse classification methods outperforms typical approaches for MEG decoding.
arXiv (Cornell University), Sep 6, 2011
Within the preprocessing pipeline of a Next Generation Sequencing sample, its set of Single-Base ... more Within the preprocessing pipeline of a Next Generation Sequencing sample, its set of Single-Base Mismatches is one of the first outcomes, together with the number of correctly aligned reads. The union of these two sets provides a 4 × 4 matrix (called Single Base Indicator, SBI in what follows) representing a blueprint of the sample and its preprocessing ingredients such as the sequencer, the alignment software, the pipeline parameters. In this note we show that, under the same technological conditions, there is a strong relation between the SBI and the biological nature of the sample. To reach this goal we need to introduce a similarity measure between SBIs: we also show how two measures commonly used in machine learning can be of help in this context.
arXiv (Cornell University), Sep 14, 2019
Bioinformatics of high throughput omics data (e.g. microarrays and proteomics) has been plagued b... more Bioinformatics of high throughput omics data (e.g. microarrays and proteomics) has been plagued by uncountable issues with reproducibility at the start of the century. Concerns have motivated international initiatives such as the FDA's led MAQC Consortium, addressing reproducibility of predictive biomarkers by means of appropriate Data Analysis Plans (DAPs). For instance, repreated cross-validation is a standard procedure meant at mitigating the risk that information from held-out validation data may be used during model selection. We prove here that, many years later, Data Leakage can still be a non-negligible overfitting source in deep learning models for digital pathology. In particular, we evaluate the impact of (i) the presence of multiple images for each subject in histology collections; (ii) the systematic adoption of training over collection of subregions (i.e. "tiles" or "patches") extracted for the same subject. We verify that accuracy scores may be inflated up to 41%, even if a well-designed 10 × 5 iterated cross-validation DAP is applied, unless all images from the same subject are kept together either in the internal training or validation splits. Results are replicated for 4 classification tasks in digital pathology on 3 datasets, for a total of 373 subjects, and 543 total slides (around 27, 000 tiles). Impact of applying transfer learning strategies with models pre-trained on general-purpose or digital pathology datasets is also discussed. Preprint. Under review.
arXiv (Cornell University), Aug 21, 2012
We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maxi... more We introduce a novel implementation in ANSI C of the MINE family of algorithms for computing maximal information-based measures of dependence between two variables in large datasets, with the aim of a low memory footprint and ease of integration within bioinformatics pipelines. We provide the libraries minerva (with the R interface) and minepy for Python, MATLAB, Octave and C++. The C solution reduces the large memory requirement of the original Java implementation, has good upscaling properties, and offers a native parallelization for the R interface. Low memory requirements are demonstrated on the MINE benchmarks as well as on large (n=1340) microarray and Illumina GAII RNA-seq transcriptomics datasets.
arXiv (Cornell University), May 1, 2010
Many functions have been recently defined to assess the similarity among networks as tools for qu... more Many functions have been recently defined to assess the similarity among networks as tools for quantitative comparison. They stem from very different frameworks-and they are tuned for dealing with different situations. Here we show an overview of the spectral distances, highlighting their behavior in some basic cases of static and dynamic synthetic and real networks.
bioRxiv (Cold Spring Harbor Laboratory), Jun 6, 2018
Artificial Intelligence is exponentially increasing its impact on healthcare. As deep learning is... more Artificial Intelligence is exponentially increasing its impact on healthcare. As deep learning is mastering computer vision tasks, its application to digital pathology is natural, with the promise of aiding in routine reporting and standardizing results across trials. Deep learning features inferred from digital pathology scans can improve validity and robustness of current clinico-pathological features, up to identifying novel histological patterns, e.g. from tumor infiltrating lymphocytes. In this study, we examine the issue of evaluating accuracy of predictive models from deep learning features in digital pathology, as an hallmark of reproducibility. We introduce the DAPPER framework for validation based on a rigorous Data Analysis Plan derived from the FDA's MAQC project, designed to analyse causes of variability in predictive biomarkers. We apply the framework on models that identify tissue of origin on 787 Whole Slide Images from the Genotype-Tissue Expression (GTEx) project. We test 3 different deep learning architectures (VGG, ResNet, Inception) as feature extractors and three classifiers (a fully connected multilayer, Support Vector Machine and Random Forests) and work with 4 datasets (5, 10, 20 or 30 classes), for a total 53000 tiles at 512 × 512 resolution. We analyze accuracy and feature stability of the machine learning classifiers, also demonstrating the need for random features and random labels diagnostic tests to identify selection bias and risks for reproducibility. Further, we use the deep features from the VGG model from GTEx on the KIMIA24 dataset for identification of slide of origin (24 classes) to train a classifier on 1060 annotated tiles and validated on 265 unseen ones. The DAPPER software, including its deep learning backbone pipeline and the HINT (Histological Imaging-Newsy Tiles) benchmark dataset derived from GTEx, is released as a basis for standardization and validation initiatives in AI for Digital Pathology.
arXiv (Cornell University), Oct 16, 2017
Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in d... more Convolutional Neural Networks (CNNs) are a popular deep learning architecture widely applied in different domains, in particular in classifying over images, for which the concept of convolution with a filter comes naturally. Unfortunately, the requirement of a distance (or, at least, of a neighbourhood function) in the input feature space has so far prevented its direct use on data types such as omics data. However, a number of omics data are metrizable, i.e., they can be endowed with a metric structure, enabling to adopt a convolutional based deep learning framework, e.g., for prediction. We propose a generalized solution for CNNs on omics data, implemented through a dedicated Keras layer. In particular, for metagenomics data, a metric can be derived from the patristic distance on the phylogenetic tree. For transcriptomics data, we combine Gene Ontology semantic similarity and gene co-expression to define a distance; the function is defined through a multilayer network where 3 layers are defined by the GO mutual semantic similarity while the fourth one by gene co-expression. As a general tool, feature distance on omics data is enabled by OmicsConv, a novel Keras layer, obtaining OmicsCNN, a dedicated deep learning framework. Here we demonstrate OmicsCNN on gut microbiota sequencing data, for Inflammatory Bowel Disease (IBD) 16S data, first on synthetic data and then a metagenomics collection of gut microbiota of 222 IBD patients.
arXiv (Cornell University), May 26, 2017
In several physical systems, important properties characterizing the system itself are theoretica... more In several physical systems, important properties characterizing the system itself are theoretically related with specific degrees of freedom. Although standard Monte Carlo simulations provide an effective tool to accurately reconstruct the physical configurations of the system, they are unable to isolate the different contributions corresponding to different degrees of freedom. Here we show that unsupervised deep learning can become a valid support to MC simulation, coupling useful insights in the phases detection task with good reconstruction performance. As a testbed we consider the 2D XY model, showing that a deep neural network based on variational autoencoders can detect the continuous Kosterlitz-Thouless (KT) transitions, and that, if endowed with the appropriate constrains, they generate configurations with meaningful physical content.
In this paper we introduce an individual based model for simulating the spread of an emerging inf... more In this paper we introduce an individual based model for simulating the spread of an emerging influenza pandemic in Italy and testing the effectiveness of some containing strategies including vaccination, antiviral prophylaxis and quarantine measures to increase social distances. Our results show that while the probability of interrupting a large outbreak is negligible, a combination of the control measures can
arXiv (Cornell University), Jan 13, 2012
Due to the ever rising importance of the network paradigm across several areas of science, compar... more Due to the ever rising importance of the network paradigm across several areas of science, comparing and classifying graphs represent essential steps in the networks analysis of complex systems. Both tasks have been recently tackled via quite different strategies, even tailored ad-hoc for the investigated problem. Here we deal with both operations by introducing the Hamming-Ipsen-Mikhailov (HIM) distance, a novel metric to quantitatively measure the difference between two graphs sharing the same vertices. The new measure combines the local Hamming distance and the global spectral Ipsen-Mikhailov distance so to overcome the drawbacks affecting the two components separately. Building then the HIM kernel function derived from the HIM distance it is possible to move from network comparison to network classification via the Support Vector Machine (SVM) algorithm. Applications of HIM distance and HIM kernel in computational biology and social networks science demonstrate the effectiveness of the proposed functions as a general purpose solution.
Current Drug Targets, Sep 18, 2020
Sensors, Nov 27, 2020
A key access point to the functioning of the autonomic nervous system is the investigation of per... more A key access point to the functioning of the autonomic nervous system is the investigation of peripheral signals. Wearable devices (WDs) enable the acquisition and quantification of peripheral signals in a wide range of contexts, from personal uses to scientific research. WDs have lower costs and higher portability than medical-grade devices. However, the achievable data quality can be lower, and data are subject to artifacts due to body movements and data losses. It is therefore crucial to evaluate the reliability and validity of WDs before their use in research. In this study, we introduce a data analysis procedure for the assessment of WDs for multivariate physiological signals. The quality of cardiac and electrodermal activity signals is validated with a standard set of signal quality indicators. The pipeline is available as a collection of open source Python scripts based on the pyphysio package. We apply the indicators for the analysis of signal quality on data simultaneously recorded from a clinical-grade device and two WDs. The dataset provides signals of six different physiological measures collected from 18 subjects with WDs. This study indicates the need to validate the use of WDs in experimental settings for research and the importance of both technological and signal processing aspects to obtain reliable signals and reproducible results.
Information Fusion, Mar 1, 2003
arXiv (Cornell University), Feb 29, 2012
mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Sci... more mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is distributed under GPL3 at the website http://mlpy.fbk.eu.
BMC Bioinformatics, Mar 1, 2018
Background: Convolutional Neural Networks can be effectively used only when data are endowed with... more Background: Convolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tree being used as the proximity measure. The patristic distance between variables is used together with a sparsified version of MultiDimensional Scaling to embed the phylogenetic tree in a Euclidean space. Results: Ph-CNN is tested with a domain adaptation approach on synthetic data and on a metagenomics collection of gut microbiota of 38 healthy subjects and 222 Inflammatory Bowel Disease patients, divided in 6 subclasses. Classification performance is promising when compared to classical algorithms like Support Vector Machines and Random Forest and a baseline fully connected neural network, e.g. the Multi-Layer Perceptron. Conclusion: Ph-CNN represents a novel deep learning approach for the classification of metagenomics data. Operatively, the algorithm has been implemented as a custom Keras layer taking care of passing to the following convolutional layer not only the data but also the ranked list of neighbourhood of each sample, thus mimicking the case of image data, transparently to the user.
Social Science Research Network, 2022
The analysis of CAGE (Cap Analysis of Gene Expression) time-course has been proposed by the FANTO... more The analysis of CAGE (Cap Analysis of Gene Expression) time-course has been proposed by the FANTOM5 Consortium to extend the understanding of the sequence of events facilitating cell state transition at the level of promoter regulation. To identify the most prominent transcriptional regulations induced by growth factors in human breast cancer, we apply here the Complexity Invariant Dynamic Time Warping motif EnRichment (CIDER) analysis approach to the CAGE timecourse datasets of MCF-7 cells stimulated by epidermal growth factor (EGF) or heregulin (HRG). We identify a multi-level cascade of regulations rooted by the Serum Response Factor (SRF) transcription factor, connecting the MAPK-mediated transduction of the HRG stimulus to the negative regulation of the MAPK pathway by the members of the DUSP family phosphatases. The finding confirms the known primary role of FOS and FOSL1, members of AP-1 family, in shaping gene expression in response to HRG induction. Moreover, we identify a new potential regulation of DUSP5 and RARA (known to antagonize the transcriptional regulation induced by the estrogen receptors) by the activity of the AP-1 complex, specific to HRG response. The results indicate that a divergence in AP-1 regulation determines cellular changes of breast cancer cells stimulated by ErbB receptors.
Uploads
Papers by Cesare Furlanello