Papers by Sanghamitra Bandyopadhyay

Communications Biology
A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough ce... more A fundamental problem of downstream analysis of scRNA-seq data is the unavailability of enough cell samples compare to the feature size. This is mostly due to the budgetary constraint of single cell experiments or simply because of the small number of available patient samples. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic cell samples. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying the standard procedures of downstream analysis. LSH-GAN outperforms the benchmarks for realistic generation of quality cell samples. Experimental results show that generated samples of LSH-GAN improves the performance of the downstream analysis such as feature (gene) selection and cell clustering. Overall, LSH-GAN therefore addressed the key challenges of small sample scRNA-seq data analy...

Briefings in Bioinformatics, 2021
Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations... more Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. Since single-cell data are susceptible to technical noise, the quality of genes selected prior to clustering is of crucial importance in the preliminary steps of downstream analysis. Therefore, interest in robust gene selection has gained considerable attention in recent years. We introduce sc-REnF [robust entropy based feature (gene) selection method], aiming to leverage the advantages of $R{\prime}{e}nyi$ and $Tsallis$ entropies in gene selection for single cell clustering. Experiments demonstrate that with tuned parameter ($q$), $R{\prime}{e}nyi$ and $Tsallis$ entropies select genes that improved the clustering results significantly, over the other competing methods. sc-REnF can capture relevancy and redundancy among the features of noisy data extremely well due to its robust objective function. Moreover, the selected features/genes can able to determine the unknown cells with a hig...

arXiv: Molecular Networks, 2020
Coronavirus Disease 2019 (COVID-19) has been creating a worldwide pandemic situation. Repurposing... more Coronavirus Disease 2019 (COVID-19) has been creating a worldwide pandemic situation. Repurposing drugs, already shown to be free of harmful side effects, for the treatment of COVID-19 patients is an important option in launching novel therapeutic strategies. Therefore, reliable molecule interaction data are a crucial basis, where drug-/protein-protein interaction networks establish invaluable, year-long carefully curated data resources. However, these resources have not yet been systematically exploited using high-performance artificial intelligence approaches. Here, we combine three networks, two of which are year-long curated, and one of which, on SARS-CoV-2-human host-virus protein interactions, was published only most recently (30th of April 2020), raising a novel network that puts drugs, human and virus proteins into mutual context. We apply Variational Graph AutoEncoders (VGAEs), representing most advanced deep learning based methodology for the analysis of data that are subj...

High dimensional, small sample size (HDSS) scRNA-seq data presents a challenge to the gene select... more High dimensional, small sample size (HDSS) scRNA-seq data presents a challenge to the gene selection task in single cell. Conventional gene selection techniques are unstable and less reliable due to the fewer number of available samples which affects cell clustering and annotation. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic samples and combining this with the original scRNA-seq data. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying gene selection procedures in high dimension scRNA-seq data. Experimental results show a significant improvement in the performance of benchmark feature (gene) selection techniques on generated samples of one synthetic and four HDSS scRNA-seq data. Comprehensive simulation study ensures the applicability of the model in the feature (gene) ...

Genome Research, 2021
Systematic delineation of complex biological systems is an ever-challenging and resource-intensiv... more Systematic delineation of complex biological systems is an ever-challenging and resource-intensive process. Single-cell transcriptomics allows us to study cell-to-cell variability in complex tissues at an unprecedented resolution. Accurate modeling of gene expression plays a critical role in the statistical determination of tissue-specific gene expression patterns. In the past few years, considerable efforts have been made to identify appropriate parametric models for single-cell expression data. The zero-inflated version of Poisson/negative binomial and log-normal distributions have emerged as the most popular alternatives owing to their ability to accommodate high dropout rates, as commonly observed in single-cell data. Although the majority of the parametric approaches directly model expression estimates, we explore the potential of modeling expression ranks, as robust surrogates for transcript abundance. Here we examined the performance of the discrete generalized beta distribut...

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations... more Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering.Here we introduce sc-CGconv (copula based graph convolution network for single cell clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. ...

2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), 2017
Occupants and their actions play major roles in building energy management as reported by previou... more Occupants and their actions play major roles in building energy management as reported by previous studies, which involves finding the optimal schedule of user actions, under a given physical context, in order to minimize their dissatisfaction. However, comparison and performance analysis of various optimizers, for the concerned problem, have not been studied previously, which is essential to gain insight into the underlying characteristics of the problem. In this work, the performance of four popular and contemporary multi-objective optimization algorithms viz. DEMO, NSGA-II, NSGA-III, and θ-DEA, for estimating the optimal schedule has been analyzed in terms of their abilities to find minimal average indoor conditions, to discover more number of alternative trade-off solutions (flexibility) and to promptly converge to a smaller minimal net dissatisfaction value (speed of convergence). Results show that NSGA-II has slightly better capabilities than NSGA-III and θ-DEA, but it clearly outperforms DEMO. The recently developed population dynamics indicators are also applied to support the observed features of the optimizers. The proposed analyzing paradigm can also be used when the optimization problem is extended to include several other objectives.
2017 Computing Conference, 2017
In the field of building energy efficiency, researchers generally focus on building performance a... more In the field of building energy efficiency, researchers generally focus on building performance and how to enhance it. The objective of this work is to empower the building occupants by putting them in the loop of efficient energy use, supporting them to achieve their objectives by pointing out how far their actions are from an optimal set of actions. Different levels of explanation are investigated. Indicators measuring the distance to optimality are, firstly, proposed. An algorithm that generates deeper explanations is then presented to determine how changing some actions impacts comfort. The paper emphasizes the importance of explanations with a real case study. It identifies the type and level of explanations needed for different occupants. The concept of replay is presented. An occupant can replay his past actions and learn from them.

Background: Gene signature is useful to represent the molecular alternation in the disease genome... more Background: Gene signature is useful to represent the molecular alternation in the disease genomes at specified conditions and is often used to distinguish samples into various groups for better research prospective as well as better clinical treatment. There is lack of efficient techniques that can take into account the complex gene expression profile and able to identify the most relevant signatures. Methods: In this article, we presented a new framework to identify Dense Module based gene Signature (DeMoS) and their targeting miRNAs through Quasi-Clique detection algorithm and their application in prognosis survival study. Here we applied a cervical cancer data repository with prognosis clinical data to conduct our experiment. We first performed Empirical Bayes test using Limma method to identify dysregulated genes (or, miRNAs). MiRNA-mediated dysregulated target genes had been extracted from those dysregulated miRNAs. Thereafter, We detected dense co-expressed modules using Quas...

With an increasing number of SARS-CoV-2 sequences available day by day, new genomic information i... more With an increasing number of SARS-CoV-2 sequences available day by day, new genomic information is getting revealed to us. As SARS-CoV-2 sequences highlight wide changes across the samples, we aim to explore whether these changes reveal the geographical origin of the corresponding samples. Thek-mer distributions, denoting normalized frequency counts of all possible combinations of nucleotide of size uptok, are often helpful to explore sequence level patterns. Given the SARS-CoV-2 sequences are highly imbalanced by its geographical origin (relatively with a higher number samples collected from the USA), we observe that with proper under-samplingk-mer distributions in the SARS-CoV-2 sequences predict its geographical origin with more than 90% accuracy. The experiments are performed on the samples collected from six countries with maximum number of sequences available till July 07, 2020. This comprises SARS-CoV-2 sequences from Australia, USA, China, India, Greece and France. Moreover,...

ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible to... more ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introducesc-REnF, a novel androbustentropy basedfeature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at:https://github.com/Snehalikalall/sc-REnF

Gene selection in unannotated large single cell RNA sequencing (scRNA-seq) data is important and ... more Gene selection in unannotated large single cell RNA sequencing (scRNA-seq) data is important and crucial step in the preliminary step of downstream analysis. The existing approaches are primarily based on high variation (highly variable genes) or significant high expression (highly expressed genes) failed to provide stable and predictive feature set due to technical noise present in the data. Here, we proposeRgCop, a novelregularizedcopula based method for gene selection from large single cell RNA-seq data.RgCoputilizes copula correlation (Ccor), a robust equitable dependence measure that captures multivariate dependency among a set of genes in single cell expression data. We raise an objective function by adding al1regularization term withCcorto penalizes the redundant co-efficient of features/genes, resulting non-redundant effective features/genes set. Results show a significant improvement in the clustering/classification performance of real life scRNA-seq data over the other sta...
Bioinformatics, 2019
Summary DropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scal... more Summary DropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. Here we present the improved dropClust, a complete R package that is, fast, interoperable and minimally resource intensive. The new dropClust features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets. Availability and implementation dropClust is freely available at https://github.com/debsin/dropClust as an R package. A lightweight online version of the dropClust is available at https://debsinha.shinyapps.io/dropClust/. Supplementary information Supplementary data are available at Bioinformatics online.

SummarySingle cell transcriptomics provides a window into cell-to-cell variability in complex tis... more SummarySingle cell transcriptomics provides a window into cell-to-cell variability in complex tissues. Modeling single cell expression is challenging due to high noise levels and technical bias. In the past years, considerable efforts have been made to devise suitable parametric models for single cell expression data. We use Discrete Generalized Beta Distribution (DGBD) to model read counts corresponding to a gene as a function of rank. Use of DGBD yields better overall fit across genes compared to the widely used mixture model comprising Poisson and Negative Binomial density functions. Further, we use Wald’s test to probe into differential expression across cell sub-types. We package our implementation as a standalone software called ROSeq. When applied on real data-sets, ROSeq performed competitively compared to the state of the art methods including MAST, SCDE and ROTS.Software AvailabilityThe Windows, macOS and Linux - compatible softwares are available for download at https://m...

BMC Genetics, 2018
Background: Study of epigenetics is currently a high-impact research topic. Multi stage methylati... more Background: Study of epigenetics is currently a high-impact research topic. Multi stage methylation is also an area of high-dimensional prospect. In this article, we provide a new study (intra and inter-species study) on brain tissue between human and rhesus on two methylation cytosine variants based data-profiles (viz., 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) samples) through TF-miRNA-gene network based module detection. Results: First of all, we determine differentially 5hmC methylated genes for human as well as rhesus for intra-species analysis, and differentially multi-stage methylated genes for inter-species analysis. Thereafter, we utilize weighted topological overlap matrix (TOM) measure and average linkage clustering consecutively on these genesets for intraand inter-species study.We identify co-methylated and multi-stage co-methylated gene modules by using dynamic tree cut, for intra-and inter-species cases, respectively. Each module is represented by individual color in the dendrogram. Gene Ontology and KEGG pathway based analysis are then performed to identify biological functionalities of the identified modules. Finally, top ten regulator TFs and targeter miRNAs that are associated with the maximum number of gene modules, are determined for both intra-and inter-species analysis. Conclusions: The novel TFs and miRNAs obtained from the analysis are: MYST3 and ZNF771 as TFs (for human intra-species analysis), BAZ2B, RCOR3 and ATF1 as TFs (for rhesus intra-species analysis), and mml-miR-768-3p and mml-miR-561 as miRs (for rhesus intra-species analysis); and MYST3 and ZNF771 as miRs(for inter-species study). Furthermore, the genes/TFs/miRNAs that are already found to be liable for several brain-related dreadful diseases as well as rare neglected diseases (e.g., wolf Hirschhorn syndrome, Joubarts Syndrome, Huntington's disease, Simian Immunodeficiency Virus(SIV) mediated enchaphilits, Parkinsons Disease, Bipolar disorder and Schizophenia etc.) are mentioned.

IEEE Transactions on Knowledge and Data Engineering, 2015
between communities are known to be denser than the non-overlapped regions of the communities. Ho... more between communities are known to be denser than the non-overlapped regions of the communities. However, most of the existing algorithms that detect overlapping communities assume that the communities are denser than their surrounding regions, and falsely identify overlaps as communities. Further, many of these algorithms are computationally demanding and thus, do not scale reasonably with varying network sizes. In this article, we propose FOCS (Fast Overlapped Community Search), an algorithm that accounts for local connectedness in order to identify overlapped communities. FOCS is shown to be linear in number of edges and nodes. It additionally gains in speed via simultaneous selection of multiple near-best communities rather than merely the best, at each iteration. FOCS outperforms some popular overlapped community finding algorithms in terms of Index Terms-overlapping community search, social network, local heuristic, complex network ✦

ABSTRACTDifferential coexpression has recently emerged as a new way to establish a fundamental di... more ABSTRACTDifferential coexpression has recently emerged as a new way to establish a fundamental difference in expression pattern among a group of genes between two populations. Earlier methods used some scoring techniques to detect changes in correlation patterns of a gene pair in two conditions. However, modeling differential coexpression by mean of finding differences in the dependence structure of the gene pair has hitherto not been carried out.We exploit a copula-based framework to model differential coexpression between gene pair in two different conditions. The Copula is used to model the dependency between expression profiles of a gene pair. For a gene pair, the distance between two joint distributions produced by copula is served as differential coexpression. We used five pan-cancer TCGA RNA-Seq data to evaluate the model which outperforms the existing state-of-the-art. Moreover, the proposed model can detect a mild change in the coexpression pattern across two conditions. Fo...

Swarm and Evolutionary Computation, 2021
Multi-modal multi-objective optimization problems (MMMOPs) have multiple solution vectors mapping... more Multi-modal multi-objective optimization problems (MMMOPs) have multiple solution vectors mapping to the same objective vector. For MMMOPs, it is important to discover equivalent solutions associated with each point in the Pareto-Front for allowing end-users to make informed decisions. Prevalent multi-objective evolutionary algorithms are incapable of searching for multiple solution subsets, whereas, algorithms designed for MMMOPs demonstrate degraded performance in the objective space. This motivates the design of better algorithms for addressing MMMOPs. The present work highlights the disadvantage of using crowding distance in the decision space when solving MMMOPs. Subsequently, an evolutionary framework, called graph Laplacian based Optimization using Reference vector assisted Decomposition (LORD), is proposed, which is the first algorithm to use decomposition in both objective and decision space for dealing with MMMOPs. Its filtering step is further extended to present LORD-II algorithm, which demonstrates its dynamics on multi-modal many-objective problems. The efficacy of the frameworks are established by comparing their performance on 34 test instances (obtained from the CEC 2019 multi-modal multi-objective test suite) with the state-of-the-art algorithms for MMMOPs and other multi-and many-objective evolutionary algorithms. The manuscript is concluded mentioning the limitations of the proposed frameworks and future directions to design still better algorithms for MMMOPs.

Algorithms for molecular biology : AMB, 2015
Detecting protein complexes within protein-protein interaction (PPI) networks is a major step tow... more Detecting protein complexes within protein-protein interaction (PPI) networks is a major step toward the analysis of biological processes and pathways. Identification and characterization of protein complexes in PPI network is an ongoing challenge. Several high-throughput experimental techniques provide substantial number of PPIs which are widely utilized for compiling the PPI network of a species. Here we focus on detecting human protein complexes by developing a multiobjective framework. For this large human PPI network is partitioned into modules which serves as protein complex. For building the objective functions we have utilized topological properties of PPI network and biological properties based on Gene Ontology semantic similarity. The proposed method is compared with that of some state-of-the-art algorithms in the context of different performance metrics. For the purpose of biological validation of our predicted complexes we have also employed a Gene Ontology and pathway b...
Multiobjective Genetic Algorithms for Clustering, 2011
In the past few decades major advances in the fields of molecular biology and genomic technology ... more In the past few decades major advances in the fields of molecular biology and genomic technology have led to an explosive growth in the biological information generated by the scientific community. Bioinformatics has evolved as an emerging research direction in response to this deluge of information. It is viewed as the use of computational methods to make biological discoveries, and is almost synonymous with computational biology.
Uploads
Papers by Sanghamitra Bandyopadhyay