I am currently implementing a custom loss layer and in the process, I stumbled upon the implementation of mean squared error in the objectives.py file [1]. I know I'm missing something in my understanding of this loss calculation because I always thought that the average was done separately across the samples for each output in each mini-batch (axis 0 of the tensor) but it appears that the average is actually being done across the last axis, which in a single vector, would mean it's being done across the outputs. I found this by accident while working on my custom loss layer because it requires discounting the loss of a few of the outputs it a training output in a specific place is a specific value. Anyways, is my understanding of the mean squared error incorrect? Why would Keras be using the last axis and thus turning a a 1xn output vector into a 1x1 output vector?
Thanks.
[1] https://github.com/fchollet/keras/blob/master/keras/objectives.py#L7
The code in question for the MSE loss is this:
def mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
Here first y_pred and y_true are subtracted, then that result is passed to K.square, which as expected, returns the square of its parameter, and then that result is given to K.mean, which computes the mean.
So the code clearly is doing what its supposed to do. About why the last axis is operated upon, this has nothing to do with classes, it is just a convention. Note that in general, there are no classes in the MSE definition.