Understanding what is Data Leakage in Machine Learning and how it can be detected

‘One of the key things you will find here is data leakage problems and that is a serious problem you need to deal with.’

– Jeremy Howard, Founder & deep learning researcher & fast.ai / USF; CSO @ platform.ai & doc.ai

‘Data leakage is one of the most serious and widespread problems in data mining and machine learning and something that as a machine learning practitioner, you must always be on guard against.’.

– Kevyn Collins-Thompson, Associate Professor at University of Michigan

Following are the questions which are going to be addressed in the article:

What is Data Leakage in Machine Learning?

Why it is important to detect and avoid data leakage?

How to detect data leakage through an example.

What is Data Leakage in Machine Learning?

In Machine Learning, data leakage occurs when some information is fed to the model during the time of training which might not be available when the model is used to get predictions in real life. Mostly, data leakage occurs when a feature which directly or indirectly depends on the target variable which is used to train the model. But, there are other ways as well through which data leakage might occur. Sometimes, it might occur when we create an imprudent feature by ourselves which is indirectly affected by the target variable.

I will be giving an example later in this article, but you can read different interesting scenarios in which how people faced data leakage in this subreddit.

Why it is important to detect and avoid data leakage?

Data leakage can cause serious problems to our machine learning model as it generally gives exceptional accuracy or much better results than it would otherwise in a real-world scenario. But after deployment in production, it crashes or performs rather poorly. Not detecting data leakage or detecting it at a later stage may cost a lot of time as well as money for any organization working on Predictive Analytics use cases.

How to detect data leakage through an example.

Detecting data leakage can be a tedious and cumbersome task as there is no well-defined way or a set of steps to avoid data leakage. One needs to create and select features very diligently while building the model. But, there is always a possibility that data leakage might creep in our machine learning model.

There are a few steps that can be followed to give us an indicator or warning that a particular feature might be the cause of data leakage in our model.

In the next part, I have devised a simple technique that can be used for identifying the case discussed above.

I have picked the IBM HR Analytics Employee Attrition & Performance dataset to use this example of detecting data leakage.

You can access the complete notebook containing code with description for each step on github.

Initially, I created a predictive analytics model without any data leakage to predict which Employees are going to churn from the organization.

Here is the snapshot of classification report results I got:

Now let’s look at the feature importance graph:

Now, to detect data leakage, I had to create a feature which causes data leakage. So, I created a feature named ‘notice_period_served’. Let me explain how I created this new feature.

There were two cases in the problem:

1.Where the value of the target variable, ‘Attrition’ = ‘No’ :

In this case where the employees have not churned then they would also not have served Notice Period. So, for all these cases I kept all values of the feature ‘Notice_Period_Served’ as ‘No’.

2. Where the value of the target variable, ‘Attrition’ = ‘Yes’ :

For all those employees who have left the company, they might or might not have served their Notice Period. So for 50% of such employees, I kept the value of this column as ‘Yes’ and for other 50% of the employees, I kept the value of this column as ‘No’.

This way, we have created a partial data leakage in our dataset. Now, let’s see how our results change after adding the feature ‘Notice_Period_Served’ in our model.

Here is the snapshot of the results I got with this new feature introducing data leakage:

Results snapshot with new feature introducing data leakage

So, as we can see the overall accuracy increased from 0.90 to 0.95. Precision and recall for both the cases also took a high leap.

Now let’s see how the feature importance of the model has modified:

Model with modified feature importance

So we can see that our feature importance graph looks a lot different now. The newly added feature ‘Notice_Period_Served’ which causes data leakage and has impacted the accuracy of the model is also one of the important features. Usually, this is what should be the case, the feature causing data leakage should be among the top influencing features of the model. In fact, in the case of high data leakage, it well might be the most important feature. So, all the most important features should be analyzed closely to check for data leakage.

Now going one step further, let us take a look at the correlation of all the important features with the target variable.

Overview of the correlation of all the important features with the target variable

As we can see, the feature‘Notice_Period_Served’ which causes data leakage has an exceptionally very high correlation with the target variable ‘Attrition’. Although the feature was sixth among the most important features but still has a way higher correlation with the target variable than any other important features. This gives us an indication that this feature might be causing data leakage as it as highly correlated with the target variable.

So now, we know that this feature might be causing data leakage in the model. So let’s look at the distribution of this feature more closely, using a count plot:

Feature distribution display using a count plot

As we see that the distribution of the feature ‘Notice_Period_Served’ is a highly skewed one. All the cases where the value of the target variable ‘Attrition’ is ‘No’ lie only where the value of ‘Notice_Period_Served’ is ‘No’. There is no observation where the value of ‘Attrition’ is ‘No’ and the value of ‘Notice_Period_Served’ is ‘Yes’.

This graph again gives us a strong indication that this feature might be causing data leakage and we should check if we need to keep this feature in our model.

Conclusion

So, we discussed what is data leakage and what are the steps that can be taken to detect or avoid such a condition. There can also be much more complex situations like more than one feature causing data leakage or data leakage because of a continuous variable. But, following the above steps should give you a strong idea what features may be causing data leakage and thus prevent or minimize it..


WRITTEN BY Kapil Khanna, VP at AISmartz | @AiSmartz

Originally posted on https://medium.com/@AiSmartz/

Don’t get missed out – Subscribe to our newsletter and never miss an update from us: