5
$\begingroup$

I was wondering about Data Leakage in the data preparation phase during the training of a model. By definition, data leakage happens when information is revealed to the model giving it an unrealistic advantage to make better predictions, cf. Page 93 Feature Engineering for Machine Learning,2018. Following this article, there exist two kinds of data leakages:

  1. Leakage in features, which occurs when something extremely informative about the true label is included as a feature. A whole survey (cf. Leakage in data mining: Formulation, detection, and avoidance) extensively covers this point already.
  2. Data leaking in the training data, which occurs when the test set is mixed in with the training data (eg. via min-max normalization of both train and test set together). Regarding this point, there are plenty of blog posts that explain this kind of leakage, without giving examples that effectively show the effect of data leakage. In the cited blogpost, both training with and without data leakage obtain the same accuracy overall.

My question is: Do you have any example of a model with high accuracy on the test set that performs poorly on real data due to the leakage in the training data?

=============

I've already seen these two questions:

  1. Avoiding data leakage in preprocessing
  2. Does using a random train-test split lead to data leakage?

They clearly state that this problem is real, but I was not able to reproduce it myself so far. Additionally, I tried this little experiment where training with data leakage and without didn't change the prediciton on unseen data:

Internal Testing Accuracy With Leaks: 96.042
Internal Testing Accuracy Without Leaks: 96.107
Generalization Accuracy With Leaks: 95.482
Generalization Accuracy Without Leaks: 95.697
$\endgroup$
6
  • 1
    $\begingroup$ A few thoughts that haven't congealed into a convincing variation of your experiment: With an unpenalized linear model (or tree model), it makes no difference (aside from computational issues) what the shift/scale is; the result is the same effective model. With a penalized linear model, the difference is in how the penalty applies; for that to be meaningful, it must be that the test set is reasonably different in center/scale from the training set. $\endgroup$ Commented Sep 9, 2022 at 19:13
  • 1
    $\begingroup$ I tried modifying your example by sorting X by its first column, dropping the redundant columns, setting a much smaller data size, not shuffling when splitting, and otherwise tweaking the dataset, plus more severe regularization in the model, to no avail. (I also reran the experiment with many random seeds to tease out whether a difference was significant.) $\endgroup$ Commented Sep 9, 2022 at 19:14
  • 1
    $\begingroup$ All of this is based on the assumption that you're really interested in scaling rather than other preprocessing. Certainly as your first linked question asserts, any preprocessing that takes the target values into account will be easier to find a difference in performance. Also, do you only care about predictive performance, or is inference also of interest? (I stuck to accuracy scores just to align with what you've already done, but accuracy isn't the best metric.) $\endgroup$ Commented Sep 9, 2022 at 19:15
  • 1
    $\begingroup$ There's an example in Elements of Statistical Learning, Chapter 7.10.2, "The Wrong and Right Way to Do Cross-validation." They use the example of variable selection, rather than your example of scaling/shifting the data. They show that if you (1) screen on train+test data combined to choose a subset of variables, then (2) cross-validate using just those chosen variables, you get very overoptimistic results. But if you screen only on the training data within cross-validation, your results are fine. $\endgroup$
    – civilstat
    Commented Sep 22, 2022 at 14:53
  • 1
    $\begingroup$ Is this the kind of thing you had in mind for your setting #2? All their features are spurious, so I wouldn't count it as setting #1 -- none of the features are actually informative about the label, and they only seem informative because you've inappropriately used test data in choosing features. But perhaps I've misunderstood your distinction between settings 1 and 2. $\endgroup$
    – civilstat
    Commented Sep 22, 2022 at 14:55

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.