Calculating variance with large numbers

sharkin picture sharkin · Nov 12, 2009 · Viewed 7.5k times · Source

I haven't really used variance calculation that much, and I don't know quite what to expect. Actually I'm not too good with math at all.

I have a an array of 1000000 random numeric values in the range 0-10000.

The array could grow even larger, so I use 64 bit int for sum.

I have tried to find code on how to calc variance, but I don't know if I get correct output.

The mean is 4692 and median is 4533. I get variance 1483780.469308 using the following code:

// size is the element count, in this case 1000000
// value_sum is __int64

double p2 = pow( (double)(value_sum - (value_sum/size)), (double)2.0 );
double variance = sqrt( (double)(p2 / (size-1)) );

Am I getting a reasonable value?

Is anything wrong with the calculation?

Answer

Jacob picture Jacob · Nov 12, 2009

Note: It doesn't look like you're calculating the variance.

Variance is calculated by subtracting the mean from every element and calculating the weighted sum of these differences.

So what you need to do is:

// Get mean
double mean = static_cast<double>(value_sum)/size;

// Calculate variance
double variance = 0;
for(int i = 0;i<size;++i) 
{
  variance += (MyArray[i]-mean)*(MyArray[i]-mean)/size;
}

// Display
cout<<variance;

Note that this is the sample variance, and is used when the underlying distribution is unknown (so we assume a uniform distribution).

Also, after some digging around, I found that this is not an unbiased estimator. Wolfram Alpha has something to say about this, but as an example, when MATLAB computes the variance, it returns the "bias-corrected sample variance".

The bias-corrected variance can be obtained by dividing by each element by size-1, or:

//Please check that size > 1
variance += (MyArray[i]-mean)*(MyArray[i]-mean)/(size-1); 

Also note that, the value of mean remains the same.