Papers by Avanti Shrikumar
Deep neural networks and support vector machines have been shown to accurately predict genome-wid... more Deep neural networks and support vector machines have been shown to accurately predict genome-wide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and Gkm-Explain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments a...
The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcripti... more The intrinsic DNA sequence preferences and cell-type specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell-type specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species-specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repea...
Deep learning models can accurately map genomic DNA sequences to associated functional molecular ... more Deep learning models can accurately map genomic DNA sequences to associated functional molecular readouts such as protein–DNA binding data. Base-resolution importance (i.e. “attribution”) scores inferred from these models can highlight predictive sequence motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors trained on genome-wide binary or continuous molecular profiles. We show that our attribution prior dramatically improves models’ stability, interpretability, and performance on held-out data, especially when training data is...
Genes are regulated through enhancer sequences, in which transcription factor binding motifs and ... more Genes are regulated through enhancer sequences, in which transcription factor binding motifs and their specific arrangements (syntax) form a cis-regulatory code. To understand the relationship between motif syntax and transcription factor binding, we train a deep learning model that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn novel motif representations and identify rules by which motifs and syntax influence transcription factor binding. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. Interpreting deep learning models applied to high-resolution ...
Nature Biotechnology
We argue here that it is a quirk of history that both MRT and gene editing have come to the foref... more We argue here that it is a quirk of history that both MRT and gene editing have come to the forefront of public attention at roughly the same time. The early start on MRT in the United Kingdom enabled that country to successfully developed quite different regulatory policy approaches to the two technologies 5 ; in contrast, the fear of germline gene editing in the United States and Canada has frozen the policy conversation on MRT. We should not let fear drive use of a sledgehammer for regulation when a scalpel will better enable us to divide the good from the bad. Although realistic about the barriers to change, we have outlined possible ways forward for both the United States and Canada that would enable progress on MRT, or possibly some limited germline gene editing without opening the floodgate. We argue that this path, and not outright prohibition, is the best way forward because citizens deserve to benefit from the advancement of science and its applications. Moreover, in our globalized world, national prohibitions cannot fully achieve their goals. As the travel of patients to Mexico for MRT performed by US doctors demonstrates (as do other examples) 27,28 , patients who desperately wish to access certain interventions will travel abroad to get them. Unless countries such as the United States and Canada are willing to limit the entry of children born through these technologies-were it even possible, and we are skeptical-and extend their criminal jurisdiction extraterritorially to prevent the use of these technologies, the reality is that some citizens of each country will bring germline alterations back into the country. Our view is that, to best protect citizens from harm, limited regulatory pathways that can be monitored and carefully delineated are preferable to shadowy practices and a potential regulatory race to the bottom. ❐
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massi... more The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present SNPpet, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained SNPpet on the Sharpr-MPRA dataset that measures the activity of ~500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. SNPpet's predictions were moderately correlated (Spearman p = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) bindi...
Advanced machine learning models applied to large-scale genomics datasets hold the promise to be ... more Advanced machine learning models applied to large-scale genomics datasets hold the promise to be major drivers for genome science. Once trained, such models can serve as a tool to probe the relationships between data modalities, including the effect of genetic variants on phenotype. However, lack of standardization and limited accessibility of trained models have hampered their impact in practice. To address this, we present Kipoi, a collaborative initiative to define standards and to foster reuse of trained models in genomics. Already, the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. We illustrate Kipoi through canonical use cases, including model benchmarking, transfer learning, variant effect prediction, and building new models from existing ones. By ...
Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive m... more Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns i...
Deep learning, which describes a class of machine learning algorithms, has recently showed impres... more Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems -- patient classification, fundamental biological processes, and treatment of patients -- to predict whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model ...
Deep learning approaches that have produced breakthrough predictive models in computer vision, sp... more Deep learning approaches that have produced breakthrough predictive models in computer vision, speech recognition and machine translation are now being successfully applied to problems in regulatory genomics. However, deep learning architectures used thus far in genomics are often directly ported from computer vision and natural language processing applications with few, if any, domain-specific modifications. In double-stranded DNA, the same pattern may appear identically on one strand and its reverse complement due to complementary base pairing. Here, we show that conventional deep learning models that do not explicitly model this property can produce substantially different predictions on forward and reverse-complement versions of the same DNA sequence. We present four new convolutional neural network layers that leverage the reverse-complement property of genomic DNA sequence by sharing parameters between forward and reverse-complement representations in the model. These layers g...
Cell, 2012
Heart development is exquisitely sensitive to the precise temporal regulation of thousands of gen... more Heart development is exquisitely sensitive to the precise temporal regulation of thousands of genes that govern developmental decisions during differentiation. However, we currently lack a detailed understanding of how chromatin and gene expression patterns are coordinated during developmental transitions in the cardiac lineage. Here, we interrogated the transcriptome and several histone modifications across the genome during defined stages of cardiac differentiation. We find distinct chromatin patterns that are coordinated with stage-specific expression of functionally related genes, including many human disease-associated genes. Moreover, we discover a novel preactivation chromatin pattern at the promoters of genes associated with heart development and cardiac function. We further identify stagespecific distal enhancer elements and find enriched DNA binding motifs within these regions that predict sets of transcription factors that orchestrate cardiac differentiation. Together, these findings form a basis for understanding developmentally regulated chromatin transitions during lineage commitment and the molecular etiology of congenital heart disease.
Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory acti... more Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory activity should, in principle, produce identical predictions regardless of whether the sequence of the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged in the literature to enforce this symmetry: conjoined a.k.a. “siamese” architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands. However, systematic benchmarks are lacking, and neither architecture has been adapted to base-resolution signal profile prediction tasks. In this work, we extend conjoined and RCPS models to signal profile prediction, and introduce a ...
Genes are regulated through enhancer sequences, in which transcription factor binding motifs and ... more Genes are regulated through enhancer sequences, in which transcription factor binding motifs and their specific arrangements (syntax) form a cis-regulatory code. To understand the relationship between motif syntax and transcription factor binding, we train a deep learning model that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn novel motif representations and identify rules by which motifs and syntax influence transcription factor binding. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range. Most strikingly, Nanog binding is driven by motifs with a strong preference for ~10.5 bp spacings corresponding to helical periodicity. Interpreting deep learning models applied to high-resolution binding data is a powerful and versatile approach to uncover the motifs and syntax of cis-regulatory sequences.
Nature Biotechnology, May 28, 2019
We argue here that it is a quirk of history that both MRT and gene editing have come to the foref... more We argue here that it is a quirk of history that both MRT and gene editing have come to the forefront of public attention at roughly the same time. The early start on MRT in the United Kingdom enabled that country to successfully developed quite different regulatory policy approaches to the two technologies 5 ; in contrast, the fear of germline gene editing in the United States and Canada has frozen the policy conversation on MRT. We should not let fear drive use of a sledgehammer for regulation when a scalpel will better enable us to divide the good from the bad. Although realistic about the barriers to change, we have outlined possible ways forward for both the United States and Canada that would enable progress on MRT, or possibly some limited germline gene editing without opening the floodgate. We argue that this path, and not outright prohibition, is the best way forward because citizens deserve to benefit from the advancement of science and its applications. Moreover, in our globalized world, national prohibitions cannot fully achieve their goals. As the travel of patients to Mexico for MRT performed by US doctors demonstrates (as do other examples) 27,28 , patients who desperately wish to access certain interventions will travel abroad to get them. Unless countries such as the United States and Canada are willing to limit the entry of children born through these technologies-were it even possible, and we are skeptical-and extend their criminal jurisdiction extraterritorially to prevent the use of these technologies, the reality is that some citizens of each country will bring germline alterations back into the country. Our view is that, to best protect citizens from harm, limited regulatory pathways that can be monitored and carefully delineated are preferable to shadowy practices and a potential regulatory race to the bottom. ❐
Convolutional neural networks are rapidly gaining popularity in regulatory genomics. Typically, t... more Convolutional neural networks are rapidly gaining popularity in regulatory genomics. Typically, these networks have a stack of convolutional and pooling layers, followed by one or more fully connected layers. In genomics, the same positional patterns are often present across multiple convolutional channels. Therefore, in current state-of-the-art networks, there exists significant redundancy in the representations learned by standard fully connected layers. We present a new separable fully connected layer that learns a weights tensor that is the outer product of positional weights and cross-channel weights, thereby allowing the same positional patterns to be applied across multiple convolutional channels. Decomposing positional and cross-channel weights further enables us to readily impose biologically-inspired constraints on positional weights, such as symmetry. We also propose a novel regularizer and constraint that act on curvature in the positional weights. Using experiments on s...
Circulation Research, 2014
Rationale: Neonatal mice have the capacity to regenerate their hearts in response to injury, but ... more Rationale: Neonatal mice have the capacity to regenerate their hearts in response to injury, but this potential is lost after the first week of life. The transcriptional changes that underpin mammalian cardiac regeneration have not been fully characterized at the molecular level. Objective: The objectives of our study were to determine whether myocytes revert the transcriptional phenotype to a less differentiated state during regeneration and to systematically interrogate the transcriptional data to identify and validate potential regulators of this process. Methods and Results: We derived a core transcriptional signature of injury-induced cardiac myocyte (CM) regeneration in mouse by comparing global transcriptional programs in a dynamic model of in vitro and in vivo CM differentiation, in vitro CM explant model, as well as a neonatal heart resection model. The regenerating mouse heart revealed a transcriptional reversion of CM differentiation processes, including reactivation of l...
Uploads
Papers by Avanti Shrikumar