The Virus Pathogen Resource (ViPR; www.viprbrc.org) and Influenza Research Database (IRD; www. fl... more The Virus Pathogen Resource (ViPR; www.viprbrc.org) and Influenza Research Database (IRD; www. fludb.org) have developed a metadata-driven Comparative Analysis Tool for Sequences (meta-CATS), which performs statistical comparative analyses of nucleotide and amino acid sequence data to identify correlations between sequence variations and virus attributes (metadata). Meta-CATS guides users through: selecting a set of nucleotide or protein sequences; dividing them into multiple groups based on any associated metadata attribute (e.g. isolation location, host species); performing a statistical test at each aligned position; and identifying all residues that significantly differ between the groups. As proofs of concept, we have used meta-CATS to identify sequence biomarkers associated with dengue viruses isolated from different hemispheres, and to identify variations in the NS1 protein that are unique to each of the 4 dengue serotypes. Meta-CATS is made freely available to virology researchers to identify genotype-phenotype correlations for development of improved vaccines, diagnostics, and therapeutics.
In this paper, an analysis is presented of the changing baby namespace and a model is created for... more In this paper, an analysis is presented of the changing baby namespace and a model is created for predicting if a name's popularity is trending up or down. Just as cultures and societies change over time, baby names evolve to reflect these changes. By analyzing name phonemes and historical influences, one can better understand the underlying causes of the changing name trend. Utilizing the U.S. Social Security Administration (SSA) name registry and historical figure data sets, the influence of historical figures and name pronunciation on the naming trend was examined. Two neural networks were created to predict name trend, one utilizing name count and the other utilizing name pronunciation. Phoneme embeddings were also created to cluster and visualize similar and dissimilar sounding names. The analysis concluded that while historical factors do influence the U.S. naming trend, these factors are too inconsistent and sporadic to include in a name forecasting model. The phoneme-driven model classified name trend with 72% percent accuracy, while the model using name counts achieved 92% accuracy. Based on these results, there is a relationship between similar sounding names and their popularity trends, but it is not as predictive as purely using name count. 1 More information about Romeo and Juliet may be found at https://www.shakespeare.org.uk/explore-shakespeare/shakespedia/shakespearesplays/romeo-and-juliet/. Last accessed
We consider a framework for determining and estimating the conditional pairwise relationships of ... more We consider a framework for determining and estimating the conditional pairwise relationships of variables when the observed samples are contaminated with measurement error in high dimensional settings. Assuming the true underlying variables follow a multivariate Gaussian distribution, if no measurement error is present, this problem is often solved by estimating the precision matrix under sparsity constraints. However, when measurement error is present, not correcting for it leads to inconsistent estimates of the precision matrix and poor identification of relationships. We propose a new Bayesian methodology to correct for the measurement error from the observed samples. This Bayesian procedure utilizes a recent variant of the spike-and-slab Lasso to obtain a point estimate of the precision matrix, and corrects for the contamination via the recently proposed Imputation-Regularization Optimization procedure designed for missing data. Our method is shown to perform better than the naive method that ignores measurement error in both identification and estimation accuracy. To show the utility of the method, we apply the new method to establish a conditional gene network from a microarray dataset.
Proceedings of the National Academy of Sciences of the United States of America, Apr 29, 1997
Cellular and humoral immunity have been implicated in the pathogenesis of atherosclerosis. To det... more Cellular and humoral immunity have been implicated in the pathogenesis of atherosclerosis. To determine whether an intact immune system is necessary for the formation of atherosclerotic lesions, we have generated immunodeficient mice with hypercholesterolemia and atherosclerosis by crossbreeding the apolipoprotein E (apoE)deficient mouse with the recombinase activating gene 1 (Rag-1) knockout mouse. Chow-fed immunodeficient mice with targeted disruption in both apoE and Rag-1 (E0͞R0) had a 2-fold decrement in aortic root lesion size at 16 weeks of age, compared with immunocompetent littermates, which were heterozygotes at the Rag-1 locus (E0͞R1). Nearly all atherosclerotic lesions from chow-fed animals were limited to raised foam cell fatty streaks. In contrast, when a second group of animals was fed a high-fat Western-type diet to accelerate lesion development, there were no differences in either aortic root lesion size or the percent of the total aorta occupied by lesions. Fibrous plaques with well-defined caps and necrotic cores were detected in both Western diet-fed E0͞R0 and E0͞R1 animals. We conclude that T and B lymphocytes play only a minor role in the rate of forming foam cell lesions, and they are not necessary for the formation of fibroproliferative plaques. The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked ''advertisement'' in accordance with 18 U.S.C. §1734 solely to indicate this fact.
We consider high-dimensional generalized linear models when the covariates are contaminated by me... more We consider high-dimensional generalized linear models when the covariates are contaminated by measurement error. Estimates from errors-in-variables regression models are well-known to be biased in traditional low-dimensional settings if the error is unincorporated. Such models have recently become of interest when regularizing penalties are added to the estimation procedure. Unfortunately, correcting for the mismeasurements can add undue computational difficulties onto the optimization, which a new tool set for practitioners to successfully use the models. We investigate a general procedure that utilizes the recently proposed Imputation-Regularized Optimization algorithm for high-dimensional errors-in-variables models, which we implement for continuous, binary, and count response type. Crucially, our method allows for off-the-shelf linear regression methods to be employed in the presence of contaminated covariates. We apply our correction to gene microarray data, and 1 arXiv:1912.11740v2 [stat.CO] 2 Jan 2020 illustrate that it results in a great reduction in the number of false positives whilst still retaining most true positives.
Chapter 2 Extrapolative and Decomposition Models 2.1. Introduction 15 vii viii Contents 2.2. Good... more Chapter 2 Extrapolative and Decomposition Models 2.1. Introduction 15 vii viii Contents 2.2. Goodness-of-Fit Indicators 15 2.3. Averaging Techniques 18 2.3.1. The Simple Average 18 2.3.2. The Single Moving Average 18 2.3.3. Centered Moving Averages 20 2.3.4. Double Moving Averages 20 2.3.5. Weighted Moving Averages 22 2.6. New Features of Census X-12 66 References 66 References 99 Chapter 4 The Basic ARIMA Model 4.1. Introduction to ARIMA 101 4.2. Graphical Analysis of Time Series Data 102 4.2.1. Time Sequence Graphs 102 4.2.2. Correlograms and Stationarity 106 4.3. Basic Formulation of the Autoregressive Integrated Moving Average Model 108 4.4. The Sample Autocorrelation Function 110 4.5. The Standard Error of the ACF 4.6. The Bounds of Stationarity and Invertibility 4.7. The Sample Partial Autocorrelation Function 4.7.1. Standard Error of the PACF 4.8. Bounds of Stationarity and Invertibility Reviewed 4.9. Other Sample Autocorrelation Functions 4.10. Tentative Identification of Characteristic Patterns of Integrated, Autoregressive, Moving Average, and ARMA Processes
In this work, we investigate using Fourier coefficients (FCs) for capturing useful information ab... more In this work, we investigate using Fourier coefficients (FCs) for capturing useful information about viral sequences in a computationally efficient and compact manner. Specifically, we extract geographic submission location from SARS-CoV-2 sequence headers submitted to the GISAID Initiative, calculate corresponding FCs, and use the FCs to classify these sequences according to geographic location. We show that the FCs serve as useful numerical summaries for sequences that allow manipulation, identification, and differentiation via classical mathematical and statistical methods that are not readily applicable for character strings. Further, we argue that subsets of the FCs may be usable for the same purposes, which results in a reduction in storage requirements. We conclude by offering extensions of the research and potential future directions for subsequent analyses, such as the use of other series transforms for discreetly indexed signals such as genomes.
2021 IEEE Biomedical Circuits and Systems Conference (BioCAS), Oct 7, 2021
The comparison of genomic sequences is a important undertaking, for example in phylogenetic and d... more The comparison of genomic sequences is a important undertaking, for example in phylogenetic and differential sequence analyses. In this work we describe three filter designs that can be applied to genomic power spectra (PS) of any lengths to reduce their size while maintaining the relative distances which they provide and are relevant for data reduction, sorting, and correlation studies of an ensemble of sequences. Specifically we present: Minimal Variance Filtering (MVF), where the subsets of coefficients with the highest variance across a sample are selected, Automated Filter Learning (AFL), where a set of linear combinational filters are learned automatically by a 1- D deep convolutional neural network attempting to classify sequences on region of origin, and Maximal Variance Principal Components Filters (MVPCF) that provide a set of filters in the Principal component loadings determined among the highest variance elements of the PS for a sample. We provide a comparison of these approaches by examining their conservation of distances produced by the entire PS, and conclude with remarks about the benefits and drawbacks of each method while providing future avenues of pursuit for this research.
This paper introduces flowthrough centrality, a node centrality measure determined from the hiera... more This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.
Gene expression microarrays are widely used to study genome-wide gene expression profiles. Many m... more Gene expression microarrays are widely used to study genome-wide gene expression profiles. Many microarray analysis methods are available, making it challenging to decide which method to use. While the effectiveness of some of these methods has been assessed using artificial spike-in data sets, analytical approaches that work well with spike-in data may not work as well with data from real biological samples. To evaluate these methods we applied Gene Ontology (GO) term co-clustering as a comparative tool to evaluate 300 different data processing pipelines composed of various background correction, normalization and summarization methods using real biological data, based on the premise that an improvement in any step of microarray data analysis should be reflected in improved co-clustering of related genes. Our results suggest that background correction has little affect on GO term co-clustering characteristics, normalization has a big impact, some summarization methods constantly ou...
Sports such as diving, gymnastics, and ice skating rely on expert judges to score performance acc... more Sports such as diving, gymnastics, and ice skating rely on expert judges to score performance accurately. Human error and bias can affect the scores, sometimes leading to controversy, especially at high levels. Instant replay or recorded video can be used to assess judges’ scores, or sometimes update judges’ scores, during a competition. For diving in particular, judges are trained to look for certain characteristics of a dive, such as angle of entry, height of splash, and distance of the dive from the end of the board, to score each dive on a scale of 0 to 10, where a 0 is a failed dive and a 10 is a perfect dive. In an effort to obtain objective comparisons for judges’ scores, a diving meet was filmed and the video footage used to measure certain characteristics of each dive for each participant. The variables measured from the video were height of the dive at its apex, angle of entry into the water, and distance of the dive from the end of the board. The measured items were then ...
2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
Forecasting from multivariate time series data is a difficult task, made more so in the situation... more Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.
the School ofNursing, Hunter College, City University of Results • There was little evidence ofst... more the School ofNursing, Hunter College, City University of Results • There was little evidence ofstatistical change in tllisfea New York and health patterning therapist, New York, NY. sibility studl'; yet, valuable lessons were learned. Paired t-tests indi Carole Birdsall, RN, EdD, is professor at the School of cateithere-w;s a significant difference in the total power scores in Nursing, Hunter College, City University ofNew York. the imagery group, and in the expected direction (two-tailed, t-sta Monnie McGee, PhD, is assistant professor, Dept. of tistic =-2.3, P = 0.035) and the choices sub-scale (two-tailed, t Mathematics and Statistics, Southern Methodist University. statistic =-2.93, P = 0.01) of the power instrumentfrom weeks one Kim P. Baron, PhD, is a clinical psychologist. Stephen to 16 of the study. Eight of17 (47%) participants in the MI group reduced or discontinued their medications. Three of16 (19%) parLowenstein, MS, RRT, is director, Pulmonary Laboratory and ticip...
Motivation: Affymetrix GeneChip arrays require summarization in order to combine the probe-level ... more Motivation: Affymetrix GeneChip arrays require summarization in order to combine the probe-level intensities into one value represent-ing the expression level of a gene. However, probe intensity measure-mentsareexpected tobeaffectedbydifferent levelsofnon-specific-and cross-hybridization to non-specific transcripts. Here, we present a new summarization technique, the Distribution Free Weighted method (DFW), which uses information about the variability in probe behavior to estimate the extent of non-specific and cross-hybridization for each probe. The contribution of the probe is weighted accordingly during summarization, without making any distributional assumptions for the probe-level data. Results: We compare DFW with several popular summarization methods on spike-in datasets, via both our own calculations and the ‘Affycomp II ’ competition. The results show that DFW outperforms other methods when sensitivity and specificity are considered simulta-neously. With the Affycomp spike-...
In this paper, we utilize next-generation sequencing (NGS) data from the LungMap project [1] to i... more In this paper, we utilize next-generation sequencing (NGS) data from the LungMap project [1] to identify and characterize the developmental RNA transcriptome in alveolar epithelial type II (AT2) cells of embryonic mouse lungs of gestational ages embryonic days 16 (E16) and 18 (E18). Late gestation lung cellular maturation is necessary for survival at birth [4]. Using R and the BioConductor packages for RNAseq analysis, we analyze changes in the mouse lung AT2 cell RNA transcriptome as this maturation process takes place. We particularly identify the cluster of genes whose expression changes markedly between immature (E16) and mature (E18) lungs which can be used to define cell pathways that appear critical for the maturation process. Our results show that there are 98 differentially expressed genes with 82 genes where differences in counts cannot be attributed to difference in sample origin. We were surprised to identify substantial differences in RNA expression between two experien...
The Virus Pathogen Resource (ViPR; www.viprbrc.org) and Influenza Research Database (IRD; www. fl... more The Virus Pathogen Resource (ViPR; www.viprbrc.org) and Influenza Research Database (IRD; www. fludb.org) have developed a metadata-driven Comparative Analysis Tool for Sequences (meta-CATS), which performs statistical comparative analyses of nucleotide and amino acid sequence data to identify correlations between sequence variations and virus attributes (metadata). Meta-CATS guides users through: selecting a set of nucleotide or protein sequences; dividing them into multiple groups based on any associated metadata attribute (e.g. isolation location, host species); performing a statistical test at each aligned position; and identifying all residues that significantly differ between the groups. As proofs of concept, we have used meta-CATS to identify sequence biomarkers associated with dengue viruses isolated from different hemispheres, and to identify variations in the NS1 protein that are unique to each of the 4 dengue serotypes. Meta-CATS is made freely available to virology researchers to identify genotype-phenotype correlations for development of improved vaccines, diagnostics, and therapeutics.
In this paper, an analysis is presented of the changing baby namespace and a model is created for... more In this paper, an analysis is presented of the changing baby namespace and a model is created for predicting if a name's popularity is trending up or down. Just as cultures and societies change over time, baby names evolve to reflect these changes. By analyzing name phonemes and historical influences, one can better understand the underlying causes of the changing name trend. Utilizing the U.S. Social Security Administration (SSA) name registry and historical figure data sets, the influence of historical figures and name pronunciation on the naming trend was examined. Two neural networks were created to predict name trend, one utilizing name count and the other utilizing name pronunciation. Phoneme embeddings were also created to cluster and visualize similar and dissimilar sounding names. The analysis concluded that while historical factors do influence the U.S. naming trend, these factors are too inconsistent and sporadic to include in a name forecasting model. The phoneme-driven model classified name trend with 72% percent accuracy, while the model using name counts achieved 92% accuracy. Based on these results, there is a relationship between similar sounding names and their popularity trends, but it is not as predictive as purely using name count. 1 More information about Romeo and Juliet may be found at https://www.shakespeare.org.uk/explore-shakespeare/shakespedia/shakespearesplays/romeo-and-juliet/. Last accessed
We consider a framework for determining and estimating the conditional pairwise relationships of ... more We consider a framework for determining and estimating the conditional pairwise relationships of variables when the observed samples are contaminated with measurement error in high dimensional settings. Assuming the true underlying variables follow a multivariate Gaussian distribution, if no measurement error is present, this problem is often solved by estimating the precision matrix under sparsity constraints. However, when measurement error is present, not correcting for it leads to inconsistent estimates of the precision matrix and poor identification of relationships. We propose a new Bayesian methodology to correct for the measurement error from the observed samples. This Bayesian procedure utilizes a recent variant of the spike-and-slab Lasso to obtain a point estimate of the precision matrix, and corrects for the contamination via the recently proposed Imputation-Regularization Optimization procedure designed for missing data. Our method is shown to perform better than the naive method that ignores measurement error in both identification and estimation accuracy. To show the utility of the method, we apply the new method to establish a conditional gene network from a microarray dataset.
Proceedings of the National Academy of Sciences of the United States of America, Apr 29, 1997
Cellular and humoral immunity have been implicated in the pathogenesis of atherosclerosis. To det... more Cellular and humoral immunity have been implicated in the pathogenesis of atherosclerosis. To determine whether an intact immune system is necessary for the formation of atherosclerotic lesions, we have generated immunodeficient mice with hypercholesterolemia and atherosclerosis by crossbreeding the apolipoprotein E (apoE)deficient mouse with the recombinase activating gene 1 (Rag-1) knockout mouse. Chow-fed immunodeficient mice with targeted disruption in both apoE and Rag-1 (E0͞R0) had a 2-fold decrement in aortic root lesion size at 16 weeks of age, compared with immunocompetent littermates, which were heterozygotes at the Rag-1 locus (E0͞R1). Nearly all atherosclerotic lesions from chow-fed animals were limited to raised foam cell fatty streaks. In contrast, when a second group of animals was fed a high-fat Western-type diet to accelerate lesion development, there were no differences in either aortic root lesion size or the percent of the total aorta occupied by lesions. Fibrous plaques with well-defined caps and necrotic cores were detected in both Western diet-fed E0͞R0 and E0͞R1 animals. We conclude that T and B lymphocytes play only a minor role in the rate of forming foam cell lesions, and they are not necessary for the formation of fibroproliferative plaques. The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked ''advertisement'' in accordance with 18 U.S.C. §1734 solely to indicate this fact.
We consider high-dimensional generalized linear models when the covariates are contaminated by me... more We consider high-dimensional generalized linear models when the covariates are contaminated by measurement error. Estimates from errors-in-variables regression models are well-known to be biased in traditional low-dimensional settings if the error is unincorporated. Such models have recently become of interest when regularizing penalties are added to the estimation procedure. Unfortunately, correcting for the mismeasurements can add undue computational difficulties onto the optimization, which a new tool set for practitioners to successfully use the models. We investigate a general procedure that utilizes the recently proposed Imputation-Regularized Optimization algorithm for high-dimensional errors-in-variables models, which we implement for continuous, binary, and count response type. Crucially, our method allows for off-the-shelf linear regression methods to be employed in the presence of contaminated covariates. We apply our correction to gene microarray data, and 1 arXiv:1912.11740v2 [stat.CO] 2 Jan 2020 illustrate that it results in a great reduction in the number of false positives whilst still retaining most true positives.
Chapter 2 Extrapolative and Decomposition Models 2.1. Introduction 15 vii viii Contents 2.2. Good... more Chapter 2 Extrapolative and Decomposition Models 2.1. Introduction 15 vii viii Contents 2.2. Goodness-of-Fit Indicators 15 2.3. Averaging Techniques 18 2.3.1. The Simple Average 18 2.3.2. The Single Moving Average 18 2.3.3. Centered Moving Averages 20 2.3.4. Double Moving Averages 20 2.3.5. Weighted Moving Averages 22 2.6. New Features of Census X-12 66 References 66 References 99 Chapter 4 The Basic ARIMA Model 4.1. Introduction to ARIMA 101 4.2. Graphical Analysis of Time Series Data 102 4.2.1. Time Sequence Graphs 102 4.2.2. Correlograms and Stationarity 106 4.3. Basic Formulation of the Autoregressive Integrated Moving Average Model 108 4.4. The Sample Autocorrelation Function 110 4.5. The Standard Error of the ACF 4.6. The Bounds of Stationarity and Invertibility 4.7. The Sample Partial Autocorrelation Function 4.7.1. Standard Error of the PACF 4.8. Bounds of Stationarity and Invertibility Reviewed 4.9. Other Sample Autocorrelation Functions 4.10. Tentative Identification of Characteristic Patterns of Integrated, Autoregressive, Moving Average, and ARMA Processes
In this work, we investigate using Fourier coefficients (FCs) for capturing useful information ab... more In this work, we investigate using Fourier coefficients (FCs) for capturing useful information about viral sequences in a computationally efficient and compact manner. Specifically, we extract geographic submission location from SARS-CoV-2 sequence headers submitted to the GISAID Initiative, calculate corresponding FCs, and use the FCs to classify these sequences according to geographic location. We show that the FCs serve as useful numerical summaries for sequences that allow manipulation, identification, and differentiation via classical mathematical and statistical methods that are not readily applicable for character strings. Further, we argue that subsets of the FCs may be usable for the same purposes, which results in a reduction in storage requirements. We conclude by offering extensions of the research and potential future directions for subsequent analyses, such as the use of other series transforms for discreetly indexed signals such as genomes.
2021 IEEE Biomedical Circuits and Systems Conference (BioCAS), Oct 7, 2021
The comparison of genomic sequences is a important undertaking, for example in phylogenetic and d... more The comparison of genomic sequences is a important undertaking, for example in phylogenetic and differential sequence analyses. In this work we describe three filter designs that can be applied to genomic power spectra (PS) of any lengths to reduce their size while maintaining the relative distances which they provide and are relevant for data reduction, sorting, and correlation studies of an ensemble of sequences. Specifically we present: Minimal Variance Filtering (MVF), where the subsets of coefficients with the highest variance across a sample are selected, Automated Filter Learning (AFL), where a set of linear combinational filters are learned automatically by a 1- D deep convolutional neural network attempting to classify sequences on region of origin, and Maximal Variance Principal Components Filters (MVPCF) that provide a set of filters in the Principal component loadings determined among the highest variance elements of the PS for a sample. We provide a comparison of these approaches by examining their conservation of distances produced by the entire PS, and conclude with remarks about the benefits and drawbacks of each method while providing future avenues of pursuit for this research.
This paper introduces flowthrough centrality, a node centrality measure determined from the hiera... more This paper introduces flowthrough centrality, a node centrality measure determined from the hierarchical maximum concurrent flow problem (HMCFP). Based upon the extent to which a node is acting as a hub within a network, this centrality measure is defined to be the fraction of the flow passing through the node to the total flow capacity of the node. Flowthrough centrality is compared to the commonly-used centralities of closeness centrality, betweenness centrality, and flow betweenness centrality, as well as to stable betweenness centrality to measure the stability (i.e., accuracy) of the centralities when knowledge of the network topology is incomplete or in transition. Perturbations do not alter the flowthrough centrality values of nodes that are based upon flow as much as they do other types of centrality values that are based upon geodesics. The flowthrough centrality measure overcomes the problem of overstating or understating the roles that significant actors play in social networks. The flowthrough centrality is canonical in that it is determined from a natural, realized flow universally applicable to all networks.
Gene expression microarrays are widely used to study genome-wide gene expression profiles. Many m... more Gene expression microarrays are widely used to study genome-wide gene expression profiles. Many microarray analysis methods are available, making it challenging to decide which method to use. While the effectiveness of some of these methods has been assessed using artificial spike-in data sets, analytical approaches that work well with spike-in data may not work as well with data from real biological samples. To evaluate these methods we applied Gene Ontology (GO) term co-clustering as a comparative tool to evaluate 300 different data processing pipelines composed of various background correction, normalization and summarization methods using real biological data, based on the premise that an improvement in any step of microarray data analysis should be reflected in improved co-clustering of related genes. Our results suggest that background correction has little affect on GO term co-clustering characteristics, normalization has a big impact, some summarization methods constantly ou...
Sports such as diving, gymnastics, and ice skating rely on expert judges to score performance acc... more Sports such as diving, gymnastics, and ice skating rely on expert judges to score performance accurately. Human error and bias can affect the scores, sometimes leading to controversy, especially at high levels. Instant replay or recorded video can be used to assess judges’ scores, or sometimes update judges’ scores, during a competition. For diving in particular, judges are trained to look for certain characteristics of a dive, such as angle of entry, height of splash, and distance of the dive from the end of the board, to score each dive on a scale of 0 to 10, where a 0 is a failed dive and a 10 is a perfect dive. In an effort to obtain objective comparisons for judges’ scores, a diving meet was filmed and the video footage used to measure certain characteristics of each dive for each participant. The variables measured from the video were height of the dive at its apex, angle of entry into the water, and distance of the dive from the end of the board. The measured items were then ...
2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
Forecasting from multivariate time series data is a difficult task, made more so in the situation... more Forecasting from multivariate time series data is a difficult task, made more so in the situation where the number of series (p) is much larger than the length of each series (T), which makes dimension reduction desirable prior to obtaining a model. The LASSO has become a widely-used method to choose relevant covariates out of many candidates, and it has many variations and extensions, such as grouped LASSO, adaptive LASSO, weighted lag adaptive LASSO, and fused LASSO. Of these, only the weighted lag adaptive LASSO and the fused LASSO take into account natural ordering among series. To examine the ability of variations on the LASSO to choose relevant covariates for short time series we use simulations for series with fewer than 50 observations. We then apply the methods to a data set on significant changes in self-reported psycho-social symptoms in the 30 years after the Chornobyl nuclear catastrophe.
the School ofNursing, Hunter College, City University of Results • There was little evidence ofst... more the School ofNursing, Hunter College, City University of Results • There was little evidence ofstatistical change in tllisfea New York and health patterning therapist, New York, NY. sibility studl'; yet, valuable lessons were learned. Paired t-tests indi Carole Birdsall, RN, EdD, is professor at the School of cateithere-w;s a significant difference in the total power scores in Nursing, Hunter College, City University ofNew York. the imagery group, and in the expected direction (two-tailed, t-sta Monnie McGee, PhD, is assistant professor, Dept. of tistic =-2.3, P = 0.035) and the choices sub-scale (two-tailed, t Mathematics and Statistics, Southern Methodist University. statistic =-2.93, P = 0.01) of the power instrumentfrom weeks one Kim P. Baron, PhD, is a clinical psychologist. Stephen to 16 of the study. Eight of17 (47%) participants in the MI group reduced or discontinued their medications. Three of16 (19%) parLowenstein, MS, RRT, is director, Pulmonary Laboratory and ticip...
Motivation: Affymetrix GeneChip arrays require summarization in order to combine the probe-level ... more Motivation: Affymetrix GeneChip arrays require summarization in order to combine the probe-level intensities into one value represent-ing the expression level of a gene. However, probe intensity measure-mentsareexpected tobeaffectedbydifferent levelsofnon-specific-and cross-hybridization to non-specific transcripts. Here, we present a new summarization technique, the Distribution Free Weighted method (DFW), which uses information about the variability in probe behavior to estimate the extent of non-specific and cross-hybridization for each probe. The contribution of the probe is weighted accordingly during summarization, without making any distributional assumptions for the probe-level data. Results: We compare DFW with several popular summarization methods on spike-in datasets, via both our own calculations and the ‘Affycomp II ’ competition. The results show that DFW outperforms other methods when sensitivity and specificity are considered simulta-neously. With the Affycomp spike-...
In this paper, we utilize next-generation sequencing (NGS) data from the LungMap project [1] to i... more In this paper, we utilize next-generation sequencing (NGS) data from the LungMap project [1] to identify and characterize the developmental RNA transcriptome in alveolar epithelial type II (AT2) cells of embryonic mouse lungs of gestational ages embryonic days 16 (E16) and 18 (E18). Late gestation lung cellular maturation is necessary for survival at birth [4]. Using R and the BioConductor packages for RNAseq analysis, we analyze changes in the mouse lung AT2 cell RNA transcriptome as this maturation process takes place. We particularly identify the cluster of genes whose expression changes markedly between immature (E16) and mature (E18) lungs which can be used to define cell pathways that appear critical for the maturation process. Our results show that there are 98 differentially expressed genes with 82 genes where differences in counts cannot be attributed to difference in sample origin. We were surprised to identify substantial differences in RNA expression between two experien...
Uploads
Papers by Monnie McGee