Binary classification with Softmax

AKSHAYAA VAIDYANATHAN picture AKSHAYAA VAIDYANATHAN · Aug 21, 2017 · Viewed 18k times · Source

I am training a binary classifier using Sigmoid activation function with Binary crossentropy which gives good accuracy around 98%.
The same when I train using softmax with categorical_crossentropy gives very low accuracy (< 40%).
I am passing the targets for binary_crossentropy as list of 0s and 1s eg; [0,1,1,1,0].

Any idea why this is happening?

This is the model I am using for the second classifier: enter image description here

Answer

Yohan Grember picture Yohan Grember · Aug 21, 2017

Right now, your second model always answers "Class 0" as it can choose between only one class (number of outputs of your last layer).

As you have two classes, you need to compute the softmax + categorical_crossentropy on two outputs to pick the most probable one.

Hence, your last layer should be:

model.add(Dense(2, activation='softmax')
model.compile(...)

Your sigmoid + binary_crossentropy model, which computes the probability of "Class 0" being True by analyzing just a single output number, is already correct.

EDIT: Here is a small explanation about the Sigmoid function

Sigmoid can be viewed as a mapping between the real numbers space and a probability space.

Sigmoid Function

Notice that:

Sigmoid(-infinity) = 0   
Sigmoid(0) = 0.5   
Sigmoid(+infinity) = 1   

So if the real number, output of your network, is very low, the sigmoid will decide the probability of "Class 0" is close to 0, and decide "Class 1"
On the contrary, if the output of your network is very high, the sigmoid will decide the probability of "Class 0" is close to 1, and decide "Class 0"

Its decision is similar to deciding the Class only by looking the sign of your output. However, this would not allow your model to learn! Indeed, the gradient of this binary loss is null nearly everywhere, making impossible for your model to learn from error, as it is not quantified properly.

That's why sigmoid and "binary_crossentropy" are used:
They are a surrogate to the binary loss, which has nice smooth properties, and enables learning.

Also, please find more info about Softmax Function and Cross Entropy