Pearson's Chi Square Test Python

cooldood3490 picture cooldood3490 · Apr 25, 2015 · Viewed 7.5k times · Source

I have two arrays that I would like to do a Pearson's Chi Square test (goodness of fit). I want to test whether or not there is a significant difference between the expected and observed results.

observed = [11294, 11830, 10820, 12875]
expected = [10749, 10940, 10271, 11937]

I want to compare 11294 with 10749, 11830 with 10940, 10820 with 10271, etc.

Here's what I have

>>> from scipy.stats import chisquare
>>> chisquare(f_obs=[11294, 11830, 10820, 12875],f_exp=[10749, 10940, 10271, 11937])
(203.08897607453906, 9.0718379533890424e-44)

where 203 is the chi square test statistic and 9.07e-44 is the p value. I'm confused by the results. p-value = 9.07e-44 < 0.05 therefore we reject the null hypothesis and conclude that there is a significant difference between the observed and expected results. This isn't correct because the numbers are so close. How do I fix this?

Answer

Pranzell picture Pranzell · Feb 26, 2018

In general, the null hypothesis(H0) says that the two variable(X and Y) are independent, i.e. changing values in X wouldn't affect values in Y.

For example, X = [1,2,3,4] and Y = [2,4,6,8]

If you calculate the "p-value" using any method out there for this case, it should come out to be a very small value, implying that there is a very low chance of this case following the null hypothesis, i.e. a very low chance that X and Y are independent of each other.

It means it will never follow the Null Hypothesis here and these two variables are dependent on each other, in a form of Y = 2X.

In your case also, p-value score of 9.0718379533890424e-44 means the same thing, i.e. small value indicates that there is a very low chance it would suffice the null hypothesis and it means that observed and expected are related to each other and there is no independence between them.

Ps. You are correct about this.