Make the data speak itself: Hypothesis Testing

1. What is ANOVA?

ANOVA stands for Analysis of Variance in which we compare variances in order to accept or to reject null hypothesis.

2. A Case Study

Your company wants to deploy a new promotion plan that change the price of some products. They want to test if the new plan (Plan B) works better than the current one (Plan A). They randomly collect the average revenue in 15 months from both plans.

Dataset for this case is available on Github

3. Data Analysis

- Show statistical data description

- Make a descriptive plots with data to analyse the variability with scatter plot and box plot

Scatter Plot

Box Plot

Observation: PlanA and PlanB are significant difference.

- Calculate one-way ANOVA

Observation: p_value = 0.2%, the confidence is greater than 95% so that we can reject null hypothesis that group A and group B have same population mean

Why there is F statistic value in the output of ANOVA? Are we sure about the p-value we just got?

One-way ANOVA result is done with the Fisher assumption. To be sure about P-value we got, we have to test the normality of samples. The ANOVA's assumption is simple to validate but some it is tricky.

4. Assumptions

- The sample should be random and independent.
- Each treatment should be normally distributed.
- The treatments should be homoscedastic. In another word, the population standard deviations of the groups are all equal.

4.1 Normality Test
There is a theorem saying that normality for every level is equivalent to normality of residues. The normality can be tested with Shapiro-Wilk and Q-Q Plot

Shapiro-Wilk

Observation:

p_value of group A equals 45%, the significance is 55%. We cannot reject the null hypothesis that data in group A is normal distribution.
In contrast, in group B, p_value = 5% -> confidence = 95%, we reject NH hypothesis that data in group B is normally distributed.

Q-Q Plot

Observation: Graphical plots show us that data of group A is normally distributed while data of group B is not.

The result from both tests Shapiro-Wilk and QQ-Plot complements to each other that can help us to make sure about the normality of given data.

4.2. Homoscedascity

Homoscedascity means the equivalence of variances in a population. Like normality, we can do it with graphical method and Levene Test.

Graphical Method: Residual Plot

Levene-Test

Conclusion:
- p_value from Levene-Test tells that data in both group is significantly different.
- Residual plot shows that data in Group A and data in Group B is not correlated.

5. Key Take-way

- With ANOVA, assumptions are normality and homoscedasticity
- For normality we use Shapiro-Wilk and QQ plot for the residues
- For homoscedasticity we plot the residues and use Levene Test.

The testing result and charts are generated by Python NoteBook

Make the data speak itself

Thursday, December 19, 2019

Hypothesis Testing - ANOVA

No comments:

Post a Comment