Skip to main content

Questions tagged [data-leakage]

Filter by
Sorted by
Tagged with
0 votes
0 answers
11 views

Use training or testing set when calculating sample weights for evaluation?

I have an ml model that has been built from data that is not representative of the population class frequencies. The majority class is actually undersampled, and so it more frequent in the population ...
HaplessEcologist's user avatar
1 vote
1 answer
40 views

How does temporal data leakage happen?

Assume I use a moving window to slice a daily stock closing price history data. Using past 7 days to predict next day. For each training instance, I'm strictly using historical data to predict future ...
yang's user avatar
  • 149
0 votes
0 answers
29 views

Scenario, where PCA on X(train) and X(test) is not a data leakage

There are countless (re)posts discussing the question of using principal component analysis (PCA) as pre-processing method for the features of a regression/classification problem within a cross-...
Jonas S's user avatar
  • 31
0 votes
0 answers
30 views

Do features derived from an exploratory data analysis lead to inaccurate K-fold cross validation?

I'm a little confused about the interplay of exploratory data analysis, feature engineering and feature selection using k-fold cross validation. I would much appreciate it if someone could give me ...
ThePangolin's user avatar
3 votes
1 answer
69 views

Best Practices for Splitting Data in a Repeated Measures Classification Problem

I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes ...
jpsca1293's user avatar
0 votes
0 answers
36 views

Avoiding Information Leakage in Backtesting with CPCV-Tuned Hyperparameters

I'm using Combinatorial Purged Cross-Validation to tune hyperparameters for a binary classification model applied in a month-end trading strategy. I have 6 months of data and used CPCV with 15 splits ...
June's user avatar
  • 1
1 vote
0 answers
37 views

How should you split up data in a train-test-validation split

I've seen it is generally recommended when using a train-test-validation data split, to first split your data into train and test datasets, and then furtherly split the train dataset into a train and ...
sammcm998's user avatar
0 votes
0 answers
74 views

Does replacing binned variables with Weight of Evidence values introduce data leakage?

In my company I've been noticing some binary classification modeling code that replaces bins of a continuous variable with the corresponding Weight of Evidence (WoE) of the given bin. As far as I ...
jglad's user avatar
  • 33
0 votes
0 answers
24 views

Scaling data to a sample that is neither training nor validation. Is this data leakage?

TL;DR: Data being scaled to a sample that is neither training nor validation. Is this data leakage? Hi, I have a data set from samples that are distributed on a plate. Precisely, there are 96 wells ...
Luiz Gustavo's user avatar
0 votes
0 answers
70 views

Data leakage in time series forecasting framed as a supervised learning problem

Suppose that I have a simple univariate time series. My goal is to use the value of 3 consecutive days to predict the value of the fourth day. I built my dataset by applying a rolling window that ...
Ray's user avatar
  • 11
2 votes
1 answer
194 views

EDA and Model Selection for Forecasting while avoiding Data Leakage

How to do EDA and model selection for time series forecasting without data leakage? Im assuming just checking for missing values is ok. But is graphing the entire time series considered data leakage? ...
pandashelp's user avatar
1 vote
0 answers
60 views

Should I delete samples from the training data that are present in the testing data by accident?

I classify pairs of entities, let's say dog-cat pairs, whether there is association between them (positive class) or there is not (negative class). I have a moderately sized positive dataset (~130k ...
oliver.c's user avatar
  • 185
2 votes
0 answers
37 views

Data leakage: Train test split before or after data preprocessing? [duplicate]

A while ago I came across the word "data leakage" for the first time, and after some research, I found that it is a common mistake among data science/machine learning practitioners. But the ...
jairiidriss's user avatar
0 votes
0 answers
41 views

Nested Cross Validation performance on each Sequential Feature Selection subset

I want to get cross-validated performance values of my model after hyperparameter tuning and sequential feature selection on each feature subset. Following this example, I want to use an outer-CV ...
Charlie's user avatar
  • 142
0 votes
0 answers
26 views

Temporal leakage or different phenomena?

I've following problem/toy-example: every week I sample data describing users (one row is one user) I want to predict that in next three weeks user will be a fraud 1 or not 0, so basically binary ...
Quant Christo's user avatar
2 votes
1 answer
72 views

Data leakage or not?

The goal is to predict whether an employee will leave the company: yes or no. I have a dataframe with information about employees. There are 30 independent features and one dependent feature (Left: ...
Milvhb's user avatar
  • 21
0 votes
0 answers
41 views

Simple demonstration of imputation data leakage?

I'm aware that it's best practice to do all pre-processing within train-test splits, including data imputation. At least, it's recommended not to use the test data to generate the imputation model for ...
Evan's user avatar
  • 225
1 vote
0 answers
20 views

Data augmentation specific per class

I have a database of defects on plastic films. Defects are burns, holes, and similar things. Some defects are direction specific. A vertical sign on the material represents a scratch done on the ...
Jonny_92's user avatar
  • 151
2 votes
1 answer
498 views

Why doesn't CatBoost Encoding cause target leakage?

I'm currently working on a fraud detection problem with a dataset of 300,000 rows and 500 columns, 70 of which are categorical with over 10 categories each. I'm facing memory constraints and exploring ...
Connor's user avatar
  • 667
1 vote
1 answer
262 views

Does feature selection and model testing have to be coupled in each fold of the cross-validation?

Quick overview of my data and aims: I have two groups, 50 samples per group, and 6000 features. I want to find the minimal amount of features capable of distinguishing both groups. I know the sample ...
Luiz Gustavo's user avatar
2 votes
3 answers
322 views

What is "information leak from test to train" ? Is stratification by target a leak?

It's common practice to do procedures such as standardization and even missing value imputation (commonly based on some means) after train/test split - otherwise it is treated as information leak from ...
Ars ML's user avatar
  • 31
0 votes
1 answer
28 views

Is using the same person as data observation in different time stamps a way to produce data leakage in a Machine Learning model?

Let's assume we are going to train a regression model (could be any ML tabular solution for regression. Ex.: LGBM, XGBoost, Perceptron, ...) to predict a customer profit in the next month. While ...
Matheus Nascimento's user avatar
1 vote
1 answer
176 views

Preprocessing on training set only or both training & test set? Seems like there would be errors for both answers

Let's say I have a dataset that hasn't been split into train/test yet. Upon loading it, I discover that there are columns where there are nulls that need to be filled in, some quadratic relationships ...
Katsu's user avatar
  • 1,021
5 votes
1 answer
5k views

Do we One Hot Encode (create Dummy Variables) before or after Train/Test Split?

I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split. Responses seem to state that one-hot encoding before ...
Beans On Toast's user avatar
5 votes
0 answers
181 views

Examples of Leakages in the Training Data

I was wondering about Data Leakage in the data preparation phase during the training of a model. By definition, data leakage happens when information is revealed to the model giving it an unrealistic ...
Denis Mazzucato's user avatar
3 votes
2 answers
488 views

Is data leakage from time series autocorrelation actual data leakage?

That's the question: Is data leakage from time series autocorrelation actual data leakage? To explain it with an example (I will separate the example in numbers to give more structure to the ...
Chris's user avatar
  • 535
1 vote
1 answer
26 views

How to choose train, test and validation data for two-staged experiments?

I have a question concerning the possibility of the train-test-validation split in a staged experiment setup. The data I used is split up into 3 parts: train, test and validation data. Then I try to ...
user19452872's user avatar
1 vote
1 answer
262 views

Is data leakage a concern when using an ensemble of leave-one-out predictions?

I am new to stacking. I have a dataset with N samples and 7 tables corresponding to different data types, plus a binary label. Some tables have dozens of features, other have many thousands. I train ...
SebDL's user avatar
  • 211
1 vote
1 answer
714 views

Is shuffling timeseries data, then separating into training/testing sets a form of data leakage?

I am building a multiple regression MLP (Multi-Layer Perceptron), the input is 8 weather variables collected from October-February, and the output is another weather variable. The assumption is that ...
schmixi's user avatar
  • 43
4 votes
1 answer
612 views

Lagged variables, data leakage and machine learning

I am reading a paper that fits a random forest (RF) to some data that is grouped by company and quarter. In the data engineering stage, the authors include 'lagged' variables of many of the ...
thebabystatistician's user avatar
3 votes
1 answer
664 views

Train/test split on time-based data with lagged features

I am working with data on bank transactions, and am using RFM (recency/frequency/monetary value) features like days since last transaction, number of transactions last n days, average value of ...
mdouglas81's user avatar
3 votes
0 answers
131 views

Avoiding data leakage in preprocessing and handling unseen values in test data

I've been reading up on avoiding data leakage in the preprocessing step of a machine-learning/data-science pipeline, specifically that it is wrong to apply preprocessing to both training and test data ...
njp's user avatar
  • 131
0 votes
1 answer
64 views

how to deal with data leakage in historical data

I have a dataset containing matches from 2000 TO 2018 and I am asked to predict match outcomes for the year 2017 to avoid data leakage I am going to just train my model from 2000 to 2016. in the ...
Mohamed Amine's user avatar
1 vote
0 answers
193 views

Cross validation within a bootstrap sample: is leakage a problem here?

I would like to calculate the sampling distribution for logistic LASSO coefficients. One approach to calculating this sampling distribution is described on page 143 of "Statistical Learning with ...
D I's user avatar
  • 11
2 votes
0 answers
80 views

Quote on too good to be true model performance

I seem to recall that there is a nice quote by some (well known?) machine learning expert about too good to be true model performance. The quote is something like "If your model performance looks ...
Björn's user avatar
  • 35.2k
1 vote
0 answers
585 views

Normalization and RidgeCV in Sklearn Pipeline - possible data leakage?

To avoid data leakage between the train and test set, I'm using sklearn's Pipeline as follows: ...
flanders's user avatar
1 vote
0 answers
18 views

Preprocessing for the final model to be deployed [duplicate]

Typically for a ML workflow, we import the data (X and y), split the X and ...
spectre's user avatar
  • 350
1 vote
0 answers
247 views

Many Preprocessing steps can cause data leakage, then how should we perform EDA?

For the past week, I have been constantly checking with people on this sub on how to avoid data leakage during preprocessing like feature selection and/or scaling etc here and here. I understand most ...
nan's user avatar
  • 845
1 vote
1 answer
336 views

Data Leak or Feature Engineering in regression problem?

I recently worked on a housing price dataset, where the goal is to predict sale prices. I had the idea to construct a feature on the training set, which would be dependent on the target variable and ...
Nils Lcrx's user avatar
0 votes
1 answer
126 views

Modeling length of stay with (Cox) regression with censored observations

I'm attempting to model length of stay (LOS) in a psychiatric child/adolescent setting. My LOS is censored for a few patients because they are required to leave the facility when they turn 18. I was ...
Chiel's user avatar
  • 1
2 votes
1 answer
123 views

Data Leakage Concerns

I've come across the concept of data leakage in which optimistically biased generalisation errors occur due to test data in some sense 'seeing' the training data. For instance, normalisation on an ...
N Blake's user avatar
  • 579
1 vote
1 answer
908 views

Avoiding data leakage in preprocessing

I'm a data science newbie and a bit confused with the following: I usually do the preprocessing on all predictors of a dataset, meaning I create X by concatenating <...
LeLuc's user avatar
  • 671
2 votes
1 answer
1k views

What is the difference between standardizing time series data and non-time series data?

From reading some answers on this site (1, 2, 3 and 4) I found that, on time series data, standardization must be applied separately on the train and test sets to avoid data leakage. So the train data ...
Marcus's user avatar
  • 265
0 votes
1 answer
53 views

Avoiding Data Leakage from Bucketed Features During Cross-Validation

I am working on a classification problem and have engineered a few categorical features with high cardinality by dummying out the most frequently occuring values and then using the response variable ...
Jake Niederer's user avatar
0 votes
0 answers
864 views

Does k-fold cross-validation induce data leakage in time series data?

I created a predicative model using neural networks and applied in on a time series. This is how I split my data: ...
Marcus's user avatar
  • 265
10 votes
1 answer
3k views

Does using a random train-test split lead to data leakage?

I am trying to understand data leakage in modeling practice. If we had a dataset of patient instances from 2000-2018 (with all patient visits included), and used a randomly selected train-test split (...
AmeySMahajan's user avatar
1 vote
0 answers
230 views

Doubly Robust Estimator

When use Doubly Robust Estimator we train m0/m1 models and propensity score model to be used by the estimator. Is it OK to use the same dataset to train those models and then use them to measure ATE ...
Dennis Lyubyvy's user avatar
1 vote
1 answer
657 views

Machine Learning + Hyperparameter Tuning + Data Leakage : Is my procedure free of data leakage?

I'm trying to classify 8 types of hand gestures with EMG signals. For that I followed these steps: Split the entire data into training data and test data For training data I extracted features. Here ...
Debbie's user avatar
  • 129
5 votes
1 answer
104 views

what does it mean that there is leakage of information when one uses a test set?

I have read about the term "leakage of information" that occurs when one tries to estimate the generalization error by using a test set in Machine Learning models. However, I was not able to ...
Layla's user avatar
  • 631
0 votes
1 answer
257 views

Data leakage with clustered observations

I have (what I call) a clustered dataset, that is: for one client, I can have multiple observations that will have some variables in common and some variables will be specific to each observation. ...
amestrian's user avatar
  • 265