Can we generate contingency table for chisquare test using python?

Question 1

Can we generate contingency table for chisquare test using python?

python statistics scipy statsmodels chi-squared

icm · Jul 15, 2014 · Viewed 8.2k times · Source

Answer

Answer

You can use pandas.crosstab to generate a contingency table from a DataFrame. From the documentation:

Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

Below is an usage example:

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Some fake data.
n = 5  # Number of samples.
d = 3  # Dimensionality.
c = 2  # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])

# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contingency)

The following data table

generates the following contingency table

Then, scipy.stats.chi2_contingency(contingency) returns (0.052, 0.819, 1, array([[1.6, 0.4],[2.4, 0.6]])).

Question 2

I am using scipy.stats.chi2_contingency method to get chi square statistics. We need to pass frequency table i.e. contingency table as parameter. But I have a feature vector and want to automatically generate the frequency table. Do we have any such function available? I am doing it like this currently:

def contigency_matrix_categorical(data_series,target_series,target_val,indicator_val):
  observed_freq={}
  for targets in target_val:
      observed_freq[targets]={}
      for indicators in indicator_val:
          observed_freq[targets][indicators['val']]=data_series[((target_series==targets)&(data_series==indicators['val']))].count()
  f_obs=[]
  var1=0
  var2=0
  for i in observed_freq:
      var1=var1+1
      var2=0
      for j in observed_freq[i]:
          f_obs.append(observed_freq[i][j]+5)
          var2=var2+1
  arr=np.array(f_obs).reshape(var1,var2)
  c,p,dof,expected=chi2_contingency(arr)
  return {'score':c,'pval':p,'dof':dof}

Where data series and target series are the columns values and the other two are the name of the indicator. Can anyone help? thanks

Can we generate contingency table for chisquare test using python?

Answer

Related questions