I was wondering about Data Leakage in the data preparation phase during the training of a model. By definition, data leakage happens when information is revealed to the model giving it an unrealistic advantage to make better predictions, cf. Page 93 Feature Engineering for Machine Learning,2018. Following this article, there exist two kinds of data leakages:
- Leakage in features, which occurs when something extremely informative about the true label is included as a feature. A whole survey (cf. Leakage in data mining: Formulation, detection, and avoidance) extensively covers this point already.
- Data leaking in the training data, which occurs when the test set is mixed in with the training data (eg. via min-max normalization of both train and test set together). Regarding this point, there are plenty of blog posts that explain this kind of leakage, without giving examples that effectively show the effect of data leakage. In the cited blogpost, both training with and without data leakage obtain the same accuracy overall.
My question is: Do you have any example of a model with high accuracy on the test set that performs poorly on real data due to the leakage in the training data?
=============
I've already seen these two questions:
They clearly state that this problem is real, but I was not able to reproduce it myself so far. Additionally, I tried this little experiment where training with data leakage and without didn't change the prediciton on unseen data:
Internal Testing Accuracy With Leaks: 96.042
Internal Testing Accuracy Without Leaks: 96.107
Generalization Accuracy With Leaks: 95.482
Generalization Accuracy Without Leaks: 95.697