Conditional mean over a Pandas DataFrame

Oliver G picture Oliver G · Jun 27, 2017 · Viewed 20.2k times · Source

I have a dataset from which I want a few averages of multiple variables I created.

I started off with:

data2['socialIdeology2'].mean()

data2['econIdeology'].mean()

^ that works perfectly, and gives me the averages I'm looking for.

Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'

Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.

This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:

def mean():
    sum = 0.0
    count = 0
    for index in range(0, len(data2['socialIdeology2'])):
        sum = sum + (data2['socialIdeology2'][index])
        print(data2['socialIdeology2'][index])
        count = count + 1
    return sum / count

print(mean())

Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.

So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?

And how can I get generate means by category?

Answer

Brad Solomon picture Brad Solomon · Jun 27, 2017

Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():

means = data2.groupby('voteChoice').mean()

or maybe, in your case, the following would be more efficient:

means = data2.groupby('voteChoice')['socialIdeology2'].mean()

to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.