Recently, I have seen many folks wanting to start with data science, study concepts/tests, but often cannot intuitively infer where to use a particular test, validation method when working with data.
So, under this series, I’ll be starting with a real-time problem at hand, and we’ll derive the intuition behind concepts and define the usage. We won’t even know the fancy names until the very end of it! Hence, the mystery..
Let’s see how many masters of mystic stats can we breed here.
With that context set up, let’s get started.
A client(say John) that we were engaging with had built a feature (Add to favourite) in their commercial website. The revenue model was based on registrations. He had to justify senior stakeholders that the feature was indeed valuable and generating more registrations($$$).
So, the business ask was:
Is my feature(Add to favourite) generating more revenue?
Well, how do we measure this?
Follow along to see.
The data collection part consisted of identifying a unique user, if the user engaged with the feature and finally if the user registered.
A simple way of answering the above question is to split users into two groups. Expose the new feature to them. The other half is oblivious to the new feature’s existence in the same timeframe. Now, just get the number of registrations for each.
For 10K users in a same time frame, it would have looked like below:
5K – 20 % registrations with users not exposed to features
5K – 30% registrations with users exposed to features
Guess what, this intuitive idea is called A/B Testing. This is ideal to have. But often, features are not built to allow this flow. Welcome to real world!
Given, how common this problem is in industry, we only have some users who had accessed the website without feature before, and some users who have accessed the website recently with feature in recent past. Well, now John was in a soup, as he had data in the following format.
For 10K users in a different time frame, it looks like below:
5K – 20 % registrations with users not exposed to features(say in 2017)
5K – 30% registrations with users exposed to features(say in 2018)
Now, this time one could conclude that there is a 10 % increase in registrations. Hence the feature is adding value.
But WAIT, is that really the case???
One could question that the number of registrations is not constant and is changing. This increase could just be random.
Well, well…you don’t like being questioned(you should not, but that’s for another time), but can’t deny the fact that it is a valid question. So, you need a method/insight that lets you measure the chances of this changes happening randomly.
Let’s rearrange data first in a format that allows some insights. Below are the numbers we got.
Engaged | Not Engaged | Total Rows
Registered | 2000 |1000 | 3000
Not Registered | 500 |6500 | 7000
Total Columns |2500 |7500 | 10000
Take a minute and understand the table above.
What we did above is just summarise the total information in a table. Let’s call this TABLE 1.
Now, we can see 2 set of outcomes,
Engagement – Yes/No
Registration – Yes/No
Now the question I have ‘Is ENGAGEMENT having an effect on REGISTRATION?‘ or to rephrase, Is there a relationship between these two variables? or to rephrase again, ‘Does Registration depends on Engagement?‘ or rephrase finally, ‘Are REGISTRATION and ENGAGEMENT INDEPENDENT?’
We want to know, what would be the user number if this feature had no impact.
Consider only the data in total rows. We can derive the insight that there is 25% chance that user would register( 2500/10000). Let’s get the numbers in cells based on the chance we have. So, for the first cell, if we had 3000 users visit, and 25% of them engage, then the expected number is 3000 x 25%= 750. Similarly, if we had 7000 users who visit and 25% of them engage, then expected non-registered users is 7000 * 25% = 1750
Engaged | Not Engaged
Registered | 3000* 0.25 = 750 |3000* 0.75= 2250
Not Registered | 7000* 0.25 =1750 |7000* 0.75 = 5250
Let’s call this TABLE 2.
The above numbers give us the values if there was no relationship between the two variables under consideration. The condition we framed above ‘if there was no relationship between the two variables under consideration‘ really not an interesting one. It adds no further conclusion if this turns out to be true. Lets call this CONDITION 1.
Okay, we got this numbers if there was not a relation between the two variables.
Now, what we need is a probability of getting the values we observed if the values we expected follow a random normal distribution.
We identify the residual for each cell, square and normalise with the expected value(for obvious reasons), and have a sum across both the variables.
A typical cell would look like:
The above value is the residue for the current criteria i.e. Registered and Engaged. We follow similar calculations for others. Now we add these across all cells.
This should be the ‘Total Residue’. Assuming that the error is normally distributed(ASSUMPTION 1), and we have to identify the chances of getting this error in the corresponding PDF.
Now, we just need to find the error value under this PDF and we have the answer to ‘What are the chances of getting observed value totally at random if the variables are independent?’.
Note that, if we sample from a standard normal random variable, and square the values(which we did before – to deal with negative values), the resulting distribution is an interesting one and has been studied for long. Let’s call this DISTRIBUTION OF INTEREST. We leverage the study here and look up the table to get that value.
The moment of truth:
Let’s take a pause here for a good news so far. Let me just tell you, we just covered the following:
- Contingency Tables.
- Expected Frequencies.
- Null Hypothesis(Hypothesis testing).
- Chi-square statistic.
- Assumptions of chi-square application(application of central limit theorem).
- Chi-square distribution.
Don’t believe it.Let’s just see how…
- TABLE 1 is a Contingency table.
- TABLE 2 is Expected frequencies.
- CONDITION 1 is NULL HYPOTHESIS.(In summary, the uninteresting condition of your variables would be the ‘null hypothesis’.)
- RESIDUAL VALUE is the CHI-SQUARE Statistic
- ASSUMPTION 1 is the one of the assumptions when applying chi-square test, also an application of Central limit theorem.
- DISTRIBUTION OF INTEREST is CHI SQUARE DISTRIBUTION.
- To be honest, we did not cover it yet. It is below.
The value we have from the chi-square table is ‘<0.00001‘. This is the p-value.
That’s it. What we can conclude is, ‘The chances of getting a result at least this extreme are less than 0.001% if the variables are totally independent.’
Hence, we can reject the null hypothesis, that engagement is not really impacting registrations. John can have a peaceful night! So can you :).
Moral of the story: It’s all really intuitive. We don’t really need to recite formulas. Just need to have a moment of reflection and done!
That’s it folks in today’s episode of mystic stats!
Happy Machine Learning!!!