Hi! I'm Jake, a Program Manager at DataRobot. Today, I'll be talking to you about
target leakage. Now target leakage is a problem because if you use a model that
contains leakage, it may look good during the training phase but it will make
wildly inaccurate predictions once you finally get to using in a production
environment. So what is target leakage? It's when you're using information in
the dataset that you would not have at the time of prediction. Effectively,
you're using data from the future in order to try and predict the past. For
any Mean Girls fans out there, it's like that scene with Karen where she is able
to predict the weather if she already knows that it's raining. She's using
information from the future to then try and predict the past. So let's look at
this in a banking environment. We have a bank underwriter whose responsibility is
to decide which loans the bank will or will not decide to issue. Each one is
going to expose the bank to a certain degree of risk.
Now if the bank underwriter is relying on a model that contains target leakage,
a new loan application might show that it has a 20 percent chance of default according
to the model, but the reality is that it's actually far greater at say 80
percent chance of default. This exposes the bank to significantly more risk than
the underwriter was expecting and perhaps then the bank's rules actually
allow for. But how did all this happen? It starts with the curation of the original
dataset. We want to make sure that we're only using information we'd have at the
time of prediction and not anything that occurs later than that.
So with this specific example, the dataset accidentally contained the latest
snapshot of all of our loan customers' FICO scores. This is problematic because
we're trying to predict the chance of default at the time of application not say
90 days after the loan has already been issued. So we need to be very very
careful in these situations to ensure that we don't include information from
the future when we're trying to predict information that's occurring in the past.
In this specific example, if we wanted to say predict the chance of a loan
default after 90 days, that most recent snapshot of FICO scores is perfectly
acceptable to use because it's information that's occurring at the
same exact time as when the predictions are being made. So now that we know a bit
more about target leakage we want to focus on how do we eliminated. Perhaps
the only possible way to do so is to focus on the dataset that you're going
to use for training when you're creating it. We want to think about each
individual variable and really understand the context of if that
information is available at the same time that we're making prediction or if
that data only becomes available to us or is known after the time of prediction.
In the case of the banker, again we want to focus on making predictions at
the time the loan application is received and therefore we can only use
information from that same point in time. We would need FICO scores of all of our
bank customers at the time that they made their application, not nine days
after they've already had the loan. If we look at it in the case of a
telecommunications provider, we want to look at each one of these variables and
understand if it's available to us at the point in time where we're trying to
predict if a specific wireless subscriber will churn. So, customer ID, the
length of time in months that their account has been open, their state of
residence, different features about their cell phone plan, and if they're a
business or individual subscriber, are all variables and information we would
know when we're trying to predict if an individual subscriber is going to churn
in the future. Now the one that should cause some red flags and raise a bit of
suspicion is this "Rep" column here. This is the name of the last service rep that
any one of these customers spoke to. Now it may seem innocent enough on the
surface. However, for this specific wireless communications company, they
route all of their retention calls through one department.
Therefore there are a finite number of names that could appear as the last
representative an individual customer spoke to. In this case, Rose and Taylor
both work in the retention department and therefore they're commonly
associated and they're appearing here next to customers who ended up churning.
This is problematic because the model will start to associate the name Rose
and the name Taylor with customers that have a high risk of churning and it will
start to assign lower risk to names like Frank and Rene that didn't result in
customers that churned. The problem with this though is that we would never know
if a customer is going to speak to Rose or Taylor before we know
their intent churn because we would not route them to the retention
department otherwise. So what we want to do is either eliminate this column entirely or
we would want to use this dataset for a different type of problem where we do
know the last service rep they spoke to at that time of prediction. A great
example would be if we want to predict if a previous wireless subscriber who
did churn is willing or interested in coming back for our service. So that's
target leakage for you today. You should be very careful and mindful to make sure
it doesn't find its way into your training data sets. Otherwise, it can
cost your company millions in the long run and impact customer satisfaction
negatively.
