A lot of real-life transactional/CRM applications are not careful in creating snapshots of the data (historical data) for the model to learn from.
Imagine a typical example of predicting the ACV growth of a new account.
A potential variable of interest is the employee count of the account and measuring to see if a company's employee growth is predictive of ACV growth.
The tricky part is a lot of times, such variables are available only real-time and they have not been snapshot (sometimes the records get overwritten and we lose the historic data)
Although this variable is available during prediction, we will not be able to use this variable when constructing a training dataset based on historical data - in such circumstances, we might be tempted to overlook the look-ahead bias and use such variables in training as well.
In the past, I have omitted such variables and managed to build models without including them. This can lead to a bit of confusion for the clients/stakeholders involved as they assume that such a simple variable must be available for the model to use it.
My question here is :
What are the possible approaches one can experiment with, to still use this variable for training when it is not snapshot (apart from omitting them)? Proxy variables can be a method but then again, they require being snapshot and in most cases, it's either all such variables are snapshot or none. I was not able to find much research on this - maybe I was not looking for the right topics.
What are the potential side-effects of including this variable during training (basically ignoring the bias) - specifically, is there any way to separate out and estimate the error due to this bias alone? Personally I would never want to include it but if it is more of a tradeoff, then I would want to estimate the expected error.
Are there any non-supervised learning methods which doesn't require the presence of training data useful for this? (Unsupervised like Clustering perhaps - any research about this?)