Questions tagged [data-leakage]
The data-leakage tag has no usage guidance.
76 questions
0
votes
0
answers
11
views
Use training or testing set when calculating sample weights for evaluation?
I have an ml model that has been built from data that is not representative of the population class frequencies. The majority class is actually undersampled, and so it more frequent in the population ...
1
vote
1
answer
40
views
How does temporal data leakage happen?
Assume I use a moving window to slice a daily stock closing price history data. Using past 7 days to predict next day. For each training instance, I'm strictly using historical data to predict future ...
0
votes
0
answers
29
views
Scenario, where PCA on X(train) and X(test) is not a data leakage
There are countless (re)posts discussing the question of using principal component analysis (PCA) as pre-processing method for the features of a regression/classification problem within a cross-...
0
votes
0
answers
30
views
Do features derived from an exploratory data analysis lead to inaccurate K-fold cross validation?
I'm a little confused about the interplay of exploratory data analysis, feature engineering and feature selection using k-fold cross validation. I would much appreciate it if someone could give me ...
3
votes
1
answer
69
views
Best Practices for Splitting Data in a Repeated Measures Classification Problem
I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes ...
0
votes
0
answers
36
views
Avoiding Information Leakage in Backtesting with CPCV-Tuned Hyperparameters
I'm using Combinatorial Purged Cross-Validation to tune hyperparameters for a binary classification model applied in a month-end trading strategy. I have 6 months of data and used CPCV with 15 splits ...
1
vote
0
answers
37
views
How should you split up data in a train-test-validation split
I've seen it is generally recommended when using a train-test-validation data split, to first split your data into train and test datasets, and then furtherly split the train dataset into a train and ...
0
votes
0
answers
74
views
Does replacing binned variables with Weight of Evidence values introduce data leakage?
In my company I've been noticing some binary classification modeling code that replaces bins of a continuous variable with the corresponding Weight of Evidence (WoE) of the given bin. As far as I ...
0
votes
0
answers
24
views
Scaling data to a sample that is neither training nor validation. Is this data leakage?
TL;DR: Data being scaled to a sample that is neither training nor validation. Is this data leakage?
Hi,
I have a data set from samples that are distributed on a plate. Precisely, there are 96 wells ...
0
votes
0
answers
70
views
Data leakage in time series forecasting framed as a supervised learning problem
Suppose that I have a simple univariate time series. My goal is to use the value of 3 consecutive days to predict the value of the fourth day.
I built my dataset by applying a rolling window that ...
2
votes
1
answer
194
views
EDA and Model Selection for Forecasting while avoiding Data Leakage
How to do EDA and model selection for time series forecasting without data leakage?
Im assuming just checking for missing values is ok. But is graphing the entire time series considered data leakage?
...
1
vote
0
answers
60
views
Should I delete samples from the training data that are present in the testing data by accident?
I classify pairs of entities, let's say dog-cat pairs, whether there is association between them (positive class) or there is not (negative class). I have a moderately sized positive dataset (~130k ...
2
votes
0
answers
37
views
Data leakage: Train test split before or after data preprocessing? [duplicate]
A while ago I came across the word "data leakage" for the first time, and after some research, I found that it is a common mistake among data science/machine learning practitioners. But the ...
0
votes
0
answers
41
views
Nested Cross Validation performance on each Sequential Feature Selection subset
I want to get cross-validated performance values of my model after hyperparameter tuning and sequential feature selection on each feature subset. Following this example, I want to use an outer-CV ...
0
votes
0
answers
26
views
Temporal leakage or different phenomena?
I've following problem/toy-example:
every week I sample data describing users (one row is one user)
I want to predict that in next three weeks user will be a fraud 1 or not 0, so basically binary ...
2
votes
1
answer
72
views
Data leakage or not?
The goal is to predict whether an employee will leave the company: yes or no. I have a dataframe with information about employees. There are 30 independent features and one dependent feature (Left: ...
0
votes
0
answers
41
views
Simple demonstration of imputation data leakage?
I'm aware that it's best practice to do all pre-processing within train-test splits, including data imputation. At least, it's recommended not to use the test data to generate the imputation model for ...
1
vote
0
answers
20
views
Data augmentation specific per class
I have a database of defects on plastic films. Defects are burns, holes, and similar things. Some defects are direction specific. A vertical sign on the material represents a scratch done on the ...
2
votes
1
answer
498
views
Why doesn't CatBoost Encoding cause target leakage?
I'm currently working on a fraud detection problem with a dataset of 300,000 rows and 500 columns, 70 of which are categorical with over 10 categories each. I'm facing memory constraints and exploring ...
1
vote
1
answer
262
views
Does feature selection and model testing have to be coupled in each fold of the cross-validation?
Quick overview of my data and aims:
I have two groups, 50 samples per group, and 6000 features. I want to find the minimal amount of features capable of distinguishing both groups. I know the sample ...
2
votes
3
answers
322
views
What is "information leak from test to train" ? Is stratification by target a leak?
It's common practice to do procedures such as standardization and even missing value imputation (commonly based on some means) after train/test split - otherwise it is treated as information leak from ...
0
votes
1
answer
28
views
Is using the same person as data observation in different time stamps a way to produce data leakage in a Machine Learning model?
Let's assume we are going to train a regression model (could be any ML tabular solution for regression. Ex.: LGBM, XGBoost, Perceptron, ...) to predict a customer profit in the next month. While ...
1
vote
1
answer
176
views
Preprocessing on training set only or both training & test set? Seems like there would be errors for both answers
Let's say I have a dataset that hasn't been split into train/test yet.
Upon loading it, I discover that there are columns where there are nulls that need to be filled in, some quadratic relationships ...
5
votes
1
answer
5k
views
Do we One Hot Encode (create Dummy Variables) before or after Train/Test Split?
I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split.
Responses seem to state that one-hot encoding before ...
5
votes
0
answers
181
views
Examples of Leakages in the Training Data
I was wondering about Data Leakage in the data preparation phase during the training of a model. By definition, data leakage happens when information is revealed to the model giving it an unrealistic ...
3
votes
2
answers
488
views
Is data leakage from time series autocorrelation actual data leakage?
That's the question: Is data leakage from time series autocorrelation actual data leakage?
To explain it with an example (I will separate the example in numbers to give more structure to the ...
1
vote
1
answer
26
views
How to choose train, test and validation data for two-staged experiments?
I have a question concerning the possibility of the train-test-validation split in a staged experiment setup.
The data I used is split up into 3 parts: train, test and validation data.
Then I try to ...
1
vote
1
answer
262
views
Is data leakage a concern when using an ensemble of leave-one-out predictions?
I am new to stacking. I have a dataset with N samples and 7 tables corresponding to different data types, plus a binary label. Some tables have dozens of features, other have many thousands. I train ...
1
vote
1
answer
714
views
Is shuffling timeseries data, then separating into training/testing sets a form of data leakage?
I am building a multiple regression MLP (Multi-Layer Perceptron), the input is 8 weather variables collected from October-February, and the output is another weather variable.
The assumption is that ...
4
votes
1
answer
612
views
Lagged variables, data leakage and machine learning
I am reading a paper that fits a random forest (RF) to some data that is grouped by company and quarter. In the data engineering stage, the authors include 'lagged' variables of many of the ...
3
votes
1
answer
664
views
Train/test split on time-based data with lagged features
I am working with data on bank transactions, and am using RFM (recency/frequency/monetary value) features like days since last transaction, number of transactions last n days, average value of ...
3
votes
0
answers
131
views
Avoiding data leakage in preprocessing and handling unseen values in test data
I've been reading up on avoiding data leakage in the preprocessing step of a machine-learning/data-science pipeline, specifically that it is wrong to apply preprocessing to both training and test data ...
0
votes
1
answer
64
views
how to deal with data leakage in historical data
I have a dataset containing matches from 2000 TO 2018 and I am asked to predict match outcomes for the year 2017 to avoid data leakage I am going to just train my model from 2000 to 2016. in the ...
1
vote
0
answers
193
views
Cross validation within a bootstrap sample: is leakage a problem here?
I would like to calculate the sampling distribution for logistic LASSO coefficients. One approach to calculating this sampling distribution is described on page 143 of "Statistical Learning with ...
2
votes
0
answers
80
views
Quote on too good to be true model performance
I seem to recall that there is a nice quote by some (well known?) machine learning expert about too good to be true model performance. The quote is something like "If your model performance looks ...
1
vote
0
answers
585
views
Normalization and RidgeCV in Sklearn Pipeline - possible data leakage?
To avoid data leakage between the train and test set, I'm using sklearn's Pipeline as follows:
...
1
vote
0
answers
18
views
Preprocessing for the final model to be deployed [duplicate]
Typically for a ML workflow, we import the data (X and y), split the X and ...
1
vote
0
answers
247
views
Many Preprocessing steps can cause data leakage, then how should we perform EDA?
For the past week, I have been constantly checking with people on this sub on how to avoid data leakage during preprocessing like feature selection and/or scaling etc here and here.
I understand most ...
1
vote
1
answer
336
views
Data Leak or Feature Engineering in regression problem?
I recently worked on a housing price dataset, where the goal is to predict sale prices.
I had the idea to construct a feature on the training set, which would be dependent on the target variable and ...
0
votes
1
answer
126
views
Modeling length of stay with (Cox) regression with censored observations
I'm attempting to model length of stay (LOS) in a psychiatric child/adolescent setting. My LOS is censored for a few patients because they are required to leave the facility when they turn 18.
I was ...
2
votes
1
answer
123
views
Data Leakage Concerns
I've come across the concept of data leakage in which optimistically biased generalisation errors occur due to test data in some sense 'seeing' the training data. For instance, normalisation on an ...
1
vote
1
answer
908
views
Avoiding data leakage in preprocessing
I'm a data science newbie and a bit confused with the following:
I usually do the preprocessing on all predictors of a dataset, meaning
I create X by concatenating <...
2
votes
1
answer
1k
views
What is the difference between standardizing time series data and non-time series data?
From reading some answers on this site (1, 2, 3 and 4) I found that, on time series data, standardization must be applied separately on the train and test sets to avoid data leakage.
So the train data ...
0
votes
1
answer
53
views
Avoiding Data Leakage from Bucketed Features During Cross-Validation
I am working on a classification problem and have engineered a few categorical features with high cardinality by dummying out the most frequently occuring values and then using the response variable ...
0
votes
0
answers
864
views
Does k-fold cross-validation induce data leakage in time series data?
I created a predicative model using neural networks and applied in on a time series. This is how I split my data:
...
10
votes
1
answer
3k
views
Does using a random train-test split lead to data leakage?
I am trying to understand data leakage in modeling practice.
If we had a dataset of patient instances from 2000-2018 (with all patient visits included), and used a randomly selected train-test split (...
1
vote
0
answers
230
views
Doubly Robust Estimator
When use Doubly Robust Estimator we train m0/m1 models and propensity score model to be used by the estimator.
Is it OK to use the same dataset to train those models and then use them to measure ATE ...
1
vote
1
answer
657
views
Machine Learning + Hyperparameter Tuning + Data Leakage : Is my procedure free of data leakage?
I'm trying to classify 8 types of hand gestures with EMG signals. For that I followed these steps:
Split the entire data into training data and test data
For training data I extracted features. Here ...
5
votes
1
answer
104
views
what does it mean that there is leakage of information when one uses a test set?
I have read about the term "leakage of information" that occurs when one tries to estimate the generalization error by using a test set in Machine Learning models. However, I was not able to ...
0
votes
1
answer
257
views
Data leakage with clustered observations
I have (what I call) a clustered dataset, that is: for one client, I can have multiple observations that will have some variables in common and some variables will be specific to each observation. ...