Skip to main content

Questions tagged [sequence-analysis]

Analysis of a DNA, RNA, or peptide sequence to understand its features, function, structure, or evolution.

Filter by
Sorted by
Tagged with
0 votes
0 answers
39 views

Is my time course analysis with DESeq2 valid?

As a pure behavioural ecologist who has stumbled into the world of gene expression analysis and am a novice in analyzing it, I am asking for help in validating whether my model is correct for the type ...
Jason Rissanen's user avatar
0 votes
1 answer
39 views

How to approximate the point a sequence is converging to?

I have created a poker solver as part of my Master's Thesis. This solver uses Counterfactual Regret Minimization (CFR) to compute a Nash Equilibrium of Hold'em or Omaha Poker. The solver uses existing ...
Timon Groen's user avatar
0 votes
0 answers
23 views

Random sequence generator algorithm non informative piror distribution

I want to conduct a Bayesian statistical analysis of a sequence generation phenomenon. The sequences generated contain elements from a known alphabet. Working on that, I have tried to define the prior ...
Guilhem Nespoulous's user avatar
0 votes
1 answer
55 views

Optimization of fault diagnosis sequence using probability and cost [closed]

I am developping the algorithm to optimize the fault diagnosis sequence using probability and cost. For exemple, I have 3 diagnosis actions possibles : option 1 : probability which can detect the root ...
stat_man's user avatar
0 votes
1 answer
97 views

How to look at between-group differences for a single gene using RNA seq data

I have an RNA seq dataset, but I am only interested in the expression of a single pre-specified gene and to compare it between 2 groups (patient phenotypes). Some have suggested (without a reference) ...
Kristoffer N's user avatar
0 votes
1 answer
112 views

Why use sliding window input features in sequence modeling?

I was reading through the DNABERT paper and found that their input features were k-mers. This is equivalent to using rolling/sliding window features in the other common family of sequential problem, ...
Avatrin's user avatar
  • 102
1 vote
1 answer
46 views

Statistical significance in known population

I am working with a data set with the sequence identity (a value in [0,1] representing the conservation between sequences) of many genes for many bacterial strains. I would like to be able to draw ...
Rachel's user avatar
  • 13
1 vote
1 answer
76 views

How to test differences (over time and between treatments) of a specific species in DNA metabarcoding sequencing data?

I have DNA metabarcoding sequencing data in the following format: plot Time_point reads_species_A Reads_species_B reads_species_C 1 T1 0 245 65 2 T1 48 455 0 3 T1 15 5 10 1 T3 153 23 564 2 T3 ...
RobH's user avatar
  • 113
1 vote
2 answers
437 views

Weird Cooks distance results using DESeq2

I'm currently trying to assess fold change when comparing two different sample types using DESeq2 package and I'm getting weird Cook's distance values which are causing major problems. The two ...
Miguel's user avatar
  • 11
1 vote
1 answer
143 views

Multichannel distance from a reference sequence

I am working with a large dataset and applying multichannel sequence analysis to two life course domains. I would like to adapt a solution suggested in the post below to multichannel sequences, but I ...
Léa Pessin's user avatar
0 votes
1 answer
1k views

A method for clustering 1D signals?

I have samples from 150 different genes containing the following information: sequence of the gene signal strength along the length of the gene (the signal can be negative or positive). I have ...
Ender's user avatar
  • 5
2 votes
0 answers
417 views

How to calculate the evaluation metrics on streaming data for online ML algorithms

I am working on a binary classification problem where I need to develop an online ML model that can work on streaming data. However, I am not sure how can I use the evaluation metrics for ...
Amhs_11's user avatar
  • 333
1 vote
1 answer
145 views

Viewing automated cost matrix for DHD in TraMineR

I'm using social sequence analysis, and comparing between different distance methods for my data. I'm wondering if there is a way to view/call the automatic substitution cost matrix that the dynamic ...
Siobhan's user avatar
  • 13
0 votes
1 answer
31 views

Determining a p-value for a test statistic that depends on other test statistics

Sorry for the confusing wording of the title. If some has any better way to word it, please feel free to change it. Background For those unfamiliar with bioinformatics data, I have data from a ...
The_Questioner's user avatar
0 votes
0 answers
775 views

What does it mean if a simple linear neural network performs better than an LSTM on sequential data?

I'm working on a genetic data project, where one data sample is represented as sequences of integers (of length 2000) and it needs to be classified into one of 4 classes, so I guess it is similar to ...
Ronny's user avatar
  • 1
1 vote
1 answer
58 views

Why shouldn't you mix variable size inputs in the same minibatch?

I am trying to build a CNN-LSTM architecure in tf.keras that classifies sequences of varying sizes. My training data is highly variable and I would have to crop/pad sequences in order to create ...
Tom's user avatar
  • 53
0 votes
0 answers
22 views

Choosing a model for input: categorised, weighted sequence, output: binary variable

What would be an appropriate model for predicting a binary target variable, given a weighted sequence? Sequences will be reasonably short, typically between ~ 1 and 5 elements. I have in the order of ...
Ian's user avatar
  • 101
1 vote
0 answers
29 views

What are the classifiers that can be used for sequence data?

I've been going through the classifiers like Naive Bayes, Decision Tree etc. I've a sequence data like so ...
vbnr's user avatar
  • 111
0 votes
0 answers
402 views

Training and testing transformer model from scratch

As you know, transformers are one of the strongest model in the field of NLP and machine translation. I know there are many resources, but I still could not find a good tutorial teaching how to use ...
Kadaj13's user avatar
  • 395
2 votes
1 answer
618 views

Sequences comparison metrics

I know about Edit distance, Longest Common Subsequence and their normalized versions to measure the similarity between sequences But do we have any similarity measures other than the above ones?
Shivanisrivarshini's user avatar
0 votes
0 answers
27 views

Predicting the Winner of a sequence of numbers

I have various series of numbers of different lengths (ranging from 4 to 10) such as the example below: [1.5, 5.0, 6.0, 6.0, 8.0] [1.4, 6.0, 7.5, 9.0, 50.0, 100.0, 200.0] For each one of them I ...
Peterlytics's user avatar
1 vote
0 answers
84 views

Analyse set of Sequences of varying length with PCA?

Task description I have a dataset with strings indicating the sequence of the screens a user visit when making a purchase on an app. A string could be: "1,2,1,2,3,3,4,5,6,7,3,2,5,6". Another string ...
Mikkel Miqlliot Lehmann's user avatar
0 votes
1 answer
181 views

Binary Sequence Prediction Model with Time dependant features

I got a very long sequence of binary items (0 or 1). Each item is associated to a timestamp. For example : ...
hans glick's user avatar
1 vote
0 answers
98 views

Likelihood Matrix from a Random Forest?

I'm going through the supporting material of a paper (https://science.sciencemag.org/content/360/6384/81) , trying to reproduce their results (see below, note that HVG=highly variable gene). The data ...
Jo Fisher's user avatar
1 vote
0 answers
39 views

Ideas for determining the optimal sequence of calls and emails to maximize the probability of a sales lead converting to a sale?

I have a large data set of sales leads that are in the form of a lead_id, a sequence of binary integers that denote the order of emails and phone calls made to a sales lead, and the binary outcome of ...
statsquestions's user avatar
1 vote
1 answer
99 views

have many error likelihoods, how to combine to get a confidence or p value?

I'm working in bioinformatics and its been a long time since I dusted on my statistics. Basically I'm working on variant calling which amounts to sequencing a large number of sequence reads and ...
lonestar21's user avatar
3 votes
0 answers
300 views

Which distance metric to use to cluster categorical sequences (clickstreams or clickpaths)?

For my research, I want to cluster website visitors based on their clickstreams to understand different information behavior patterns (i.e., customer/visitor journeys). The data can be characterized ...
MLud's user avatar
  • 51
0 votes
1 answer
53 views

ATGC sequence of gene expression data [closed]

I am not a pro in genetics so please excuse my non-technical language. I need dataset which contains the gene expression as well as the associated ATGC sequence with each gene expression value. For ...
Statistical_Research's user avatar
1 vote
1 answer
575 views

Handling missing data in Sequence Analysis (TraMineR) within the observation window

I'm using sequence analysis. I have a question about how to deal with missing data within the observation window. The starting point of the analysis is when respondents leave secondary school (t0). I ...
Robin's user avatar
  • 11
4 votes
1 answer
431 views

Unsupervised clustering of sequence of events to subsequences

I have a big dataset of M sequences of [1 - N] events, where each event has multiple properties (start date, end date, location, ...
Dimgold's user avatar
  • 318
3 votes
2 answers
115 views

Localized distance function on sequential binary data

I am trying to find a good distance function for sequential data that is all binary. For now, I am using Edit distance however I have some more domain-specific knowledge that I would like to ...
Maximal's user avatar
  • 213
2 votes
2 answers
931 views

Sequence prediction based on non-sequential inputs

I have a dataset with timestamps and event values (true or false -- these are based on sensor data which detect room occupancy). I'd like to build a model that would take a timestamp as an input and ...
de1pher's user avatar
  • 163
1 vote
0 answers
168 views

Probability of Finding Two Matching Subsequences in a Sequence

I'm currently studying DNA sequencing and am trying to find a formula which gives the probability that a subsequence of length $k$ appears twice in a sequence of $L$ bases (characters); this is pretty ...
BodneyC's user avatar
  • 11
0 votes
1 answer
36 views

Estimation of partitioning error in next-generation sequencing experiments

[Edited: explanation of the partitioning error] I would like to estimate how the initial number of molecules (or the level of gene expression) affects reproducibility between technical replicates of ...
hibernicah's user avatar
7 votes
0 answers
805 views

traditional state-space models and LSTMs

I am trying to understand the nature of LSTMs in relation to intuitions from traditional state-space models (e.g., Kalman filtering). The code below aims to simulate a simple univariate linear state-...
user46098's user avatar
2 votes
0 answers
42 views

Selection of differentially expressed genes

I don't have any statistical background. Have some questions. I see in some research papers they select differentially expressed genes based on fold change and p.value. And in some other papers I ...
beginner's user avatar
  • 175
2 votes
1 answer
145 views

Calculating p-values for ratios of binomial variables

I have a problem that I will express in 2 ways: a math-y way and a biology way. Hopefully this will make it more clear. Math-y way: I have N observations of a pair of binomial variables, call them ...
Nathan Crook's user avatar
4 votes
2 answers
2k views

Clustering customers by their orders sequence patterns

I have dataset with clients orders. Example: ...
Andrey's user avatar
  • 61
2 votes
1 answer
91 views

Sequence prediction: ambiguity in training set

There is a set of sequences (train set), where each element is one or multiple tags: A, B -- A -- Z -- Z, A B -- A -- Z -- D ... Given a new sequence: ...
Denis Kulagin's user avatar
3 votes
1 answer
417 views

TraMineR: predicting class membership of new sequences

Question summary This question pertains to analyzing dissimilarities between discrete state sequences (e.g., using TraMineR), more specifically to classifying new, ...
Maxim.K's user avatar
  • 560
3 votes
1 answer
268 views

Occurrence of at least 1 HT and HH in sequences of 4 coin flips not equally likely

I was reading this interesting article on hot hands and streaks in sports. The article revolves around the 16 possible sequences of 4 coin flips (H = heads, T = tails): ...
beta's user avatar
  • 253
0 votes
1 answer
1k views

Recurrent Neural Network model with more than one input

I know RNNs (with LSTMs or GRUs) are now one of the most promising options for modelling sequential data, when ordering of the data matters. However, sometimes there are also some categorical ...
PDRX's user avatar
  • 103
2 votes
1 answer
73 views

Testing pdist() for statistical significance

Using pdist() in the PST package, two probabilistic suffix trees (PSTs) can be compared to each other. The function will output ...
histelheim's user avatar
  • 3,063
2 votes
0 answers
458 views

Building RNNs on mixed sequential and non-sequential data

I have a data set that is a bunch of windows, for each window I want to perform regression. The windows themselves have four sequential features that span back about 24 time steps. Additionally each ...
Brock's user avatar
  • 21
7 votes
1 answer
23k views

Optimum number of epochs and neurons for an LSTM network

I wanted to know if there's a way to select an optimum number of epochs and neurons to forecast a certain time series using LSTM, the motive being automation of the forecasting problem, i.e. the ...
Ankush Raut's user avatar
5 votes
2 answers
10k views

Sliding window for time series modelling

I am modelling on an univariate time series in a form as shown. Suppose the time interval in the series is daily base, namely every y was collected every day. I wanna use sliding window method to ...
LUSAQX's user avatar
  • 463
5 votes
1 answer
572 views

Predicting the observations in a POMDP with a recurrent neural network

I use neural networks for online sequence prediction. The performance of LSTM in this case, however, is not nearly as good as I expected. Maybe someone can help me understand where the problem lies. ...
wehnsdaefflae's user avatar
0 votes
0 answers
105 views

Is there any sequence dissimilarity measure that has an intuitive interpretation like the dissimilarity index has?

I want to know how much the sequences in my sample differ from a given ideal-typical sequence. Is there any intuitive way of interpreting the dissimilarity measures for sequences? If it would be the ...
Kenji's user avatar
  • 858
0 votes
1 answer
101 views

Is using "Normal Approximation to binomial distribution" to test mutation enrichment in genomic region correct?

We are analyzing cancer patient mutation data. We defined set of region on the human genome as binding events, (for the ones who is interested in to the subject, it is a transcription factor binding ...
MorTunco's user avatar
1 vote
1 answer
159 views

Can I use chi-squared test of independence to compare base compositions of human genome?

I need to prove that a given region on the genome maintains has the same base composition proportions (20% A, 15% T, 19% C, etc…) of that genome. I thought about doing Chi-squared test of independence ...
MorTunco's user avatar