2
$\begingroup$

The goal is to predict whether an employee will leave the company: yes or no. I have a dataframe with information about employees. There are 30 independent features and one dependent feature (Left: Yes or No). This data is gathered over a time frame of one year.

Now, I also have data about which employees worked on the same projects. Using this data, I included an extra independent feature in the original dataframe. I included the feature "leavers" which indicates with how many employees that eventually left the company an employee has worked with. An employee that worked with a lot of employees that eventually left the company has perhaps a higher propensity to leave because that employee is influenced by the other employees.

Would it be data leakage if I include this feature in my original dataframe, and do a train/test ?

$\endgroup$
3
  • 1
    $\begingroup$ It depends on how exactly the feature is defined and how the model will be used. If the model is used for prediction, and the goal is to use only information observed at time $t$ to predict whether a given employee will leave the company between $t$ and $t+h$, where $h$ in your case is a horizon of one year, it wouldn't make sense to define a feature that counts the number of close colleagues/collaborators who left the company after time $t$, because you won't be able to use this model to make forward-looking predictions if a required input is not yet observed. $\endgroup$
    – Adrian
    Commented Jun 15, 2023 at 18:12
  • 3
    $\begingroup$ However, if the model input aka feature counts close collaborators who left the company before time $t$, that would be observable to you at prediction time and might be okay. You would have to think about your train/test split because your data might not be i.i.d. $\endgroup$
    – Adrian
    Commented Jun 15, 2023 at 18:14
  • $\begingroup$ I was also thinking in those lines because at prediction time we have this information. My head seems to keep spinning in a loop. $\endgroup$
    – Milvhb
    Commented Jun 15, 2023 at 20:02

1 Answer 1

0
$\begingroup$

Yes, this would be data leakage.

An employee leaving (or not) may affect whether the other employees he's worked with will leave.

$\endgroup$
2
  • $\begingroup$ Isn't that the point of including that feature? We want to know with how many employees the employee worked with that left. Because there is often an "attrition avalanche" $\endgroup$
    – Milvhb
    Commented Jun 15, 2023 at 20:00
  • $\begingroup$ @Milvhb Sorry, my answer wasn't clear. I'm assuming your new feature 'leavers' includes employees who left after the one year, so the effect could go in both directions. $\endgroup$ Commented Jun 16, 2023 at 13:48

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.