Questions tagged [sequence-analysis]
Analysis of a DNA, RNA, or peptide sequence to understand its features, function, structure, or evolution.
123 questions
0
votes
0
answers
39
views
Is my time course analysis with DESeq2 valid?
As a pure behavioural ecologist who has stumbled into the world of gene expression analysis and am a novice in analyzing it, I am asking for help in validating whether my model is correct for the type ...
0
votes
1
answer
39
views
How to approximate the point a sequence is converging to?
I have created a poker solver as part of my Master's Thesis. This solver uses Counterfactual Regret Minimization (CFR) to compute a Nash Equilibrium of Hold'em or Omaha Poker. The solver uses existing ...
0
votes
0
answers
23
views
Random sequence generator algorithm non informative piror distribution
I want to conduct a Bayesian statistical analysis of a sequence generation phenomenon.
The sequences generated contain elements from a known alphabet.
Working on that, I have tried to define the prior ...
0
votes
1
answer
55
views
Optimization of fault diagnosis sequence using probability and cost [closed]
I am developping the algorithm to optimize the fault diagnosis sequence using probability and cost.
For exemple, I have 3 diagnosis actions possibles :
option 1 : probability which can detect the root ...
0
votes
1
answer
97
views
How to look at between-group differences for a single gene using RNA seq data
I have an RNA seq dataset, but I am only interested in the expression of a single pre-specified gene and to compare it between 2 groups (patient phenotypes). Some have suggested (without a reference) ...
0
votes
1
answer
112
views
Why use sliding window input features in sequence modeling?
I was reading through the DNABERT paper and found that their input features were k-mers. This is equivalent to using rolling/sliding window features in the other common family of sequential problem, ...
1
vote
1
answer
46
views
Statistical significance in known population
I am working with a data set with the sequence identity (a value in [0,1] representing the conservation between sequences) of many genes for many bacterial strains. I would like to be able to draw ...
1
vote
1
answer
76
views
How to test differences (over time and between treatments) of a specific species in DNA metabarcoding sequencing data?
I have DNA metabarcoding sequencing data in the following format:
plot
Time_point
reads_species_A
Reads_species_B
reads_species_C
1
T1
0
245
65
2
T1
48
455
0
3
T1
15
5
10
1
T3
153
23
564
2
T3
...
1
vote
2
answers
437
views
Weird Cooks distance results using DESeq2
I'm currently trying to assess fold change when comparing two different sample types using DESeq2 package and I'm getting weird Cook's distance values which are causing major problems.
The two ...
1
vote
1
answer
143
views
Multichannel distance from a reference sequence
I am working with a large dataset and applying multichannel sequence analysis to two life course domains. I would like to adapt a solution suggested in the post below to multichannel sequences, but I ...
0
votes
1
answer
1k
views
A method for clustering 1D signals?
I have samples from 150 different genes containing the following information:
sequence of the gene
signal strength along the length of the gene (the signal can be negative or positive).
I have ...
2
votes
0
answers
417
views
How to calculate the evaluation metrics on streaming data for online ML algorithms
I am working on a binary classification problem where I need to develop an online ML model that can work on streaming data. However, I am not sure how can I use the evaluation metrics for ...
1
vote
1
answer
145
views
Viewing automated cost matrix for DHD in TraMineR
I'm using social sequence analysis, and comparing between different distance methods for my data. I'm wondering if there is a way to view/call the automatic substitution cost matrix that the dynamic ...
0
votes
1
answer
31
views
Determining a p-value for a test statistic that depends on other test statistics
Sorry for the confusing wording of the title. If some has any better way to word it, please feel free to change it.
Background
For those unfamiliar with bioinformatics data, I have data from a ...
0
votes
0
answers
775
views
What does it mean if a simple linear neural network performs better than an LSTM on sequential data?
I'm working on a genetic data project, where one data sample is represented as sequences of integers (of length 2000) and it needs to be classified into one of 4 classes, so I guess it is similar to ...
1
vote
1
answer
58
views
Why shouldn't you mix variable size inputs in the same minibatch?
I am trying to build a CNN-LSTM architecure in tf.keras that classifies sequences of varying sizes. My training data is highly variable and I would have to crop/pad sequences in order to create ...
0
votes
0
answers
22
views
Choosing a model for input: categorised, weighted sequence, output: binary variable
What would be an appropriate model for predicting a binary target variable, given a weighted sequence?
Sequences will be reasonably short, typically between ~ 1 and 5 elements. I have in the order of ...
1
vote
0
answers
29
views
What are the classifiers that can be used for sequence data?
I've been going through the classifiers like Naive Bayes, Decision Tree etc. I've a sequence data like so
...
0
votes
0
answers
402
views
Training and testing transformer model from scratch
As you know, transformers are one of the strongest model in the field of NLP and machine translation.
I know there are many resources, but I still could not find a good tutorial teaching how to use ...
2
votes
1
answer
618
views
Sequences comparison metrics
I know about Edit distance, Longest Common Subsequence and their normalized versions to measure the similarity between sequences But do we have any similarity measures other than the above ones?
0
votes
0
answers
27
views
Predicting the Winner of a sequence of numbers
I have various series of numbers of different lengths (ranging from 4 to 10) such as the example below:
[1.5, 5.0, 6.0, 6.0, 8.0]
[1.4, 6.0, 7.5, 9.0, 50.0, 100.0, 200.0]
For each one of them I ...
1
vote
0
answers
84
views
Analyse set of Sequences of varying length with PCA?
Task description
I have a dataset with strings indicating the sequence of the screens a user visit when making a purchase on an app.
A string could be: "1,2,1,2,3,3,4,5,6,7,3,2,5,6". Another string ...
0
votes
1
answer
181
views
Binary Sequence Prediction Model with Time dependant features
I got a very long sequence of binary items (0 or 1). Each item is associated to a timestamp. For example :
...
1
vote
0
answers
98
views
Likelihood Matrix from a Random Forest?
I'm going through the supporting material of a paper (https://science.sciencemag.org/content/360/6384/81) , trying to reproduce their results (see below, note that HVG=highly variable gene). The data ...
1
vote
0
answers
39
views
Ideas for determining the optimal sequence of calls and emails to maximize the probability of a sales lead converting to a sale?
I have a large data set of sales leads that are in the form of a lead_id, a sequence of binary integers that denote the order of emails and phone calls made to a sales lead, and the binary outcome of ...
1
vote
1
answer
99
views
have many error likelihoods, how to combine to get a confidence or p value?
I'm working in bioinformatics and its been a long time since I dusted on my statistics. Basically I'm working on variant calling which amounts to sequencing a large number of sequence reads and ...
3
votes
0
answers
300
views
Which distance metric to use to cluster categorical sequences (clickstreams or clickpaths)?
For my research, I want to cluster website visitors based on their clickstreams to understand different information behavior patterns (i.e., customer/visitor journeys). The data can be characterized ...
0
votes
1
answer
53
views
ATGC sequence of gene expression data [closed]
I am not a pro in genetics so please excuse my non-technical language.
I need dataset which contains the gene expression as well as the associated ATGC sequence with each gene expression value. For ...
1
vote
1
answer
575
views
Handling missing data in Sequence Analysis (TraMineR) within the observation window
I'm using sequence analysis. I have a question about how to deal with missing data within the observation window.
The starting point of the analysis is when respondents leave secondary school (t0). I ...
4
votes
1
answer
431
views
Unsupervised clustering of sequence of events to subsequences
I have a big dataset of M sequences of [1 - N] events, where each event has multiple properties (start date, end date, location, ...
3
votes
2
answers
115
views
Localized distance function on sequential binary data
I am trying to find a good distance function for sequential data that is all binary. For now, I am using Edit distance however I have some more domain-specific knowledge that I would like to ...
2
votes
2
answers
931
views
Sequence prediction based on non-sequential inputs
I have a dataset with timestamps and event values (true or false -- these are based on sensor data which detect room occupancy). I'd like to build a model that would take a timestamp as an input and ...
1
vote
0
answers
168
views
Probability of Finding Two Matching Subsequences in a Sequence
I'm currently studying DNA sequencing and am trying to find a formula which gives the probability that a subsequence of length $k$ appears twice in a sequence of $L$ bases (characters); this is pretty ...
0
votes
1
answer
36
views
Estimation of partitioning error in next-generation sequencing experiments
[Edited: explanation of the partitioning error]
I would like to estimate how the initial number of molecules (or the level of gene expression) affects reproducibility between technical replicates of ...
7
votes
0
answers
805
views
traditional state-space models and LSTMs
I am trying to understand the nature of LSTMs in relation to intuitions from traditional state-space models (e.g., Kalman filtering). The code below aims to simulate a simple univariate linear state-...
2
votes
0
answers
42
views
Selection of differentially expressed genes
I don't have any statistical background. Have some questions.
I see in some research papers they select differentially expressed genes based on fold change and p.value. And in some other papers I ...
2
votes
1
answer
145
views
Calculating p-values for ratios of binomial variables
I have a problem that I will express in 2 ways: a math-y way and a biology way. Hopefully this will make it more clear.
Math-y way:
I have N observations of a pair of binomial variables, call them ...
4
votes
2
answers
2k
views
Clustering customers by their orders sequence patterns
I have dataset with clients orders. Example:
...
2
votes
1
answer
91
views
Sequence prediction: ambiguity in training set
There is a set of sequences (train set), where each element is one or multiple tags:
A, B -- A -- Z -- Z, A
B -- A -- Z -- D
...
Given a new sequence:
...
3
votes
1
answer
417
views
TraMineR: predicting class membership of new sequences
Question summary
This question pertains to analyzing dissimilarities between discrete state sequences (e.g., using TraMineR), more specifically to classifying new, ...
3
votes
1
answer
268
views
Occurrence of at least 1 HT and HH in sequences of 4 coin flips not equally likely
I was reading this interesting article on hot hands and streaks in sports. The article revolves around the 16 possible sequences of 4 coin flips (H = heads, T = tails):
...
0
votes
1
answer
1k
views
Recurrent Neural Network model with more than one input
I know RNNs (with LSTMs or GRUs) are now one of the most promising options for modelling sequential data, when ordering of the data matters. However, sometimes there are also some categorical ...
2
votes
1
answer
73
views
Testing pdist() for statistical significance
Using pdist() in the PST package, two probabilistic suffix trees (PSTs) can be compared to each other. The function will output ...
2
votes
0
answers
458
views
Building RNNs on mixed sequential and non-sequential data
I have a data set that is a bunch of windows, for each window I want to perform regression. The windows themselves have four sequential features that span back about 24 time steps. Additionally each ...
7
votes
1
answer
23k
views
Optimum number of epochs and neurons for an LSTM network
I wanted to know if there's a way to select an optimum number of epochs and neurons to forecast a certain time series using LSTM, the motive being automation of the forecasting problem, i.e. the ...
5
votes
2
answers
10k
views
Sliding window for time series modelling
I am modelling on an univariate time series in a form as shown. Suppose the time interval in the series is daily base, namely every y was collected every day.
I wanna use sliding window method to ...
5
votes
1
answer
572
views
Predicting the observations in a POMDP with a recurrent neural network
I use neural networks for online sequence prediction. The performance of LSTM in this case, however, is not nearly as good as I expected. Maybe someone can help me understand where the problem lies.
...
0
votes
0
answers
105
views
Is there any sequence dissimilarity measure that has an intuitive interpretation like the dissimilarity index has?
I want to know how much the sequences in my sample differ from a given ideal-typical sequence. Is there any intuitive way of interpreting the dissimilarity measures for sequences?
If it would be the ...
0
votes
1
answer
101
views
Is using "Normal Approximation to binomial distribution" to test mutation enrichment in genomic region correct?
We are analyzing cancer patient mutation data. We defined set of region on the human genome as binding events, (for the ones who is interested in to the subject, it is a transcription factor binding ...
1
vote
1
answer
159
views
Can I use chi-squared test of independence to compare base compositions of human genome?
I need to prove that a given region on the genome maintains has the same base composition proportions (20% A, 15% T, 19% C, etc…) of that genome. I thought about doing Chi-squared test of independence ...