Input format for Kruskal-Wallis test in Python

Annevv picture Annevv · May 21, 2015 · Viewed 10.3k times · Source

I am comparing regions in the DNA on structural breaks in cancer patients and healthy people. I am trying to run a Kruskal-Wallis test (SciPy Stats) on the number of breaks for each region, to see if there is a difference between the two distributions. I am not sure if the input for the Kruskal - Wallis should be arrays (documentation), or a list of arrays (elsewhere on the internet).

First, I tried an array for sample+control like this:

controls = ['1', '2', '3', '4', '5']
samples = ['10', '20', '30', '40', '50']
n=0
for item in controls:
    array_item = np.array([item, samples[n]])
    kw_test = stats.mstats.kruskalwallis(array_item)
    print(kw_test)
    n+=1

That gave me the following output for all items:

(0.0, nan)

I also tried converting the individual datapoints in arrays, and then run the KW-test.

controls = ['1', '2', '3', '4', '5']
samples = ['10', '20', '30', '40', '50']
n=0
kw_results = []
for item in controls:
    array_controls = np.array([item])
    array_samples = np.array([samples[n]])
    kw_test = stats.mstats.kruskalwallis(array_samples, array_controls)
    kw_results.append(kw_test)
    n+=1
print(kw_results)

That gave (1.0, 0.31731050786291404) for all comparisons, even when I changed one of the lists drastically.

Digging deeper, I read that the input should be a list of arrays, so I thought that giving only two datapoints (one sample, one control) might have caused the '(0.0, nan)', so I tried that as well.

controls = ['1', '2', '3', '4', '5']
samples = ['10', '20', '30', '40', '50']
list_ = []
n=0
for item in controls:
    array_item = np.array([item, samples[n]])
    list_.append(array_item)
    n+=1
kw_test = stats.mstats.kruskalwallis(list_)
print(kw_test)

That gave me this error:

TypeError: Not implemented for this type

Now I am not sure what format/type to use, hopefully anyone can help me out!

Answer

Osian picture Osian · Jul 25, 2015

The scipy.stats.mstats.kruskalwallis module uses arrays. These can be arrays with an uneven number of observations.

If you have your data within a CSV file in separate columns, something like this should work:

import pandas
from scipy.stats import mstats

Data = pandas.read_csv("CSVfile.csv")
Col_1 = Data['Colname1']
Col_2 = Data['Colname2']
Col_3 = Data['Colname3']
Col_4 = Data['Colname4']

print("Kruskal Wallis H-test test:")

H, pval = mstats.kruskalwallis(Col_1, Col_2, Col_3, Col_4)

print("H-statistic:", H)
print("P-Value:", pval)

if pval < 0.05:
    print("Reject NULL hypothesis - Significant differences exist between groups.")
if pval > 0.05:
    print("Accept NULL hypothesis - No significant difference between groups.")