Data leakage or not?

Question

The goal is to predict whether an employee will leave the company: yes or no. I have a dataframe with information about employees. There are 30 independent features and one dependent feature (Left: Yes or No). This data is gathered over a time frame of one year.

Now, I also have data about which employees worked on the same projects. Using this data, I included an extra independent feature in the original dataframe. I included the feature "leavers" which indicates with how many employees that eventually left the company an employee has worked with. An employee that worked with a lot of employees that eventually left the company has perhaps a higher propensity to leave because that employee is influenced by the other employees.

Would it be data leakage if I include this feature in my original dataframe, and do a train/test ?

It depends on how exactly the feature is defined and how the model will be used. If the model is used for prediction, and the goal is to use only information observed at time $t$ to predict whether a given employee will leave the company between $t$ and $t+h$, where $h$ in your case is a horizon of one year, it wouldn't make sense to define a feature that counts the number of close colleagues/collaborators who left the company after time $t$, because you won't be able to use this model to make forward-looking predictions if a required input is not yet observed. — Adrian, Commented Jun 15, 2023 at 18:12
However, if the model input aka feature counts close collaborators who left the company before time $t$, that would be observable to you at prediction time and might be okay. You would have to think about your train/test split because your data might not be i.i.d. — Adrian, Commented Jun 15, 2023 at 18:14
I was also thinking in those lines because at prediction time we have this information. My head seems to keep spinning in a loop. — Milvhb, Commented Jun 15, 2023 at 20:02

Joe Mansley · Accepted Answer · 2023-06-15 18:05:56Z

0

Yes, this would be data leakage.

An employee leaving (or not) may affect whether the other employees he's worked with will leave.

answered Jun 15, 2023 at 18:05

Joe Mansley

3662 silver badges7 bronze badges

$\begingroup$ Isn't that the point of including that feature? We want to know with how many employees the employee worked with that left. Because there is often an "attrition avalanche" $\endgroup$
– Milvhb
Commented Jun 15, 2023 at 20:00
$\begingroup$ @Milvhb Sorry, my answer wasn't clear. I'm assuming your new feature 'leavers' includes employees who left after the one year, so the effect could go in both directions. $\endgroup$
– Joe Mansley
Commented Jun 16, 2023 at 13:48

Add a comment |

Stack Exchange Network

Data leakage or not?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
machine-learning
data-leakage
or ask your own question.

Hot Network Questions

Data leakage or not?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged machine-learningdata-leakage or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
data-leakage
or ask your own question.