I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.
I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).
So I know the RMSE, and also the number of samples that the algorithm was tested on.
The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.
Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.
The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:
x
compute a error e
for procedure 1 and 2.(e2^2-e1^2)
.mean/(sd/sqrt(n))
.|t|>1.96
.The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.
A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.