Examples of Leakages in the Training Data

Ask Question

Asked 2 years, 3 months ago

Modified 2 years, 3 months ago

Viewed 181 times

I was wondering about Data Leakage in the data preparation phase during the training of a model. By definition, data leakage happens when information is revealed to the model giving it an unrealistic advantage to make better predictions, cf. Page 93 Feature Engineering for Machine Learning,2018. Following this article, there exist two kinds of data leakages:

Leakage in features, which occurs when something extremely informative about the true label is included as a feature. A whole survey (cf. Leakage in data mining: Formulation, detection, and avoidance) extensively covers this point already.
Data leaking in the training data, which occurs when the test set is mixed in with the training data (eg. via min-max normalization of both train and test set together). Regarding this point, there are plenty of blog posts that explain this kind of leakage, without giving examples that effectively show the effect of data leakage. In the cited blogpost, both training with and without data leakage obtain the same accuracy overall.

My question is: Do you have any example of a model with high accuracy on the test set that performs poorly on real data due to the leakage in the training data?

=============

I've already seen these two questions:

They clearly state that this problem is real, but I was not able to reproduce it myself so far. Additionally, I tried this little experiment where training with data leakage and without didn't change the prediciton on unseen data:

Internal Testing Accuracy With Leaks: 96.042
Internal Testing Accuracy Without Leaks: 96.107
Generalization Accuracy With Leaks: 95.482
Generalization Accuracy Without Leaks: 95.697

asked Sep 8, 2022 at 15:32

Denis Mazzucato

512 bronze badges

1

$\begingroup$ A few thoughts that haven't congealed into a convincing variation of your experiment: With an unpenalized linear model (or tree model), it makes no difference (aside from computational issues) what the shift/scale is; the result is the same effective model. With a penalized linear model, the difference is in how the penalty applies; for that to be meaningful, it must be that the test set is reasonably different in center/scale from the training set. $\endgroup$
– Ben Reiniger
Commented Sep 9, 2022 at 19:13
1

$\begingroup$ I tried modifying your example by sorting X by its first column, dropping the redundant columns, setting a much smaller data size, not shuffling when splitting, and otherwise tweaking the dataset, plus more severe regularization in the model, to no avail. (I also reran the experiment with many random seeds to tease out whether a difference was significant.) $\endgroup$
– Ben Reiniger
Commented Sep 9, 2022 at 19:14
1

$\begingroup$ All of this is based on the assumption that you're really interested in scaling rather than other preprocessing. Certainly as your first linked question asserts, any preprocessing that takes the target values into account will be easier to find a difference in performance. Also, do you only care about predictive performance, or is inference also of interest? (I stuck to accuracy scores just to align with what you've already done, but accuracy isn't the best metric.) $\endgroup$
– Ben Reiniger
Commented Sep 9, 2022 at 19:15
1

$\begingroup$ There's an example in Elements of Statistical Learning, Chapter 7.10.2, "The Wrong and Right Way to Do Cross-validation." They use the example of variable selection, rather than your example of scaling/shifting the data. They show that if you (1) screen on train+test data combined to choose a subset of variables, then (2) cross-validate using just those chosen variables, you get very overoptimistic results. But if you screen only on the training data within cross-validation, your results are fine. $\endgroup$
– civilstat
Commented Sep 22, 2022 at 14:53
1

$\begingroup$ Is this the kind of thing you had in mind for your setting #2? All their features are spurious, so I wouldn't count it as setting #1 -- none of the features are actually informative about the label, and they only seem informative because you've inappropriately used test data in choosing features. But perhaps I've misunderstood your distinction between settings 1 and 2. $\endgroup$
– civilstat
Commented Sep 22, 2022 at 14:55

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Stack Exchange Network

Examples of Leakages in the Training Data

0

Your Answer

Browse other questions tagged
cross-validation
validation
data-preprocessing
data-leakage
or ask your own question.

Linked

Hot Network Questions

Examples of Leakages in the Training Data

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged cross-validationvalidationdata-preprocessingdata-leakage or ask your own question.

Linked

Related

Hot Network Questions

Browse other questions tagged
cross-validation
validation
data-preprocessing
data-leakage
or ask your own question.