Keras Binary Classification - Sigmoid activation function

Daniel Whettam picture Daniel Whettam · Mar 6, 2018 · Viewed 12.4k times · Source

I've implemented a basic MLP in Keras with tensorflow and I'm trying to solve a binary classification problem. For binary classification, it seems that sigmoid is the recommended activation function and I'm not quite understanding why, and how Keras deals with this.

I understand the sigmoid function will produce values in a range between 0 and 1. My understanding is that for classification problems using sigmoid, there will be a certain threshold used to determine the class of an input (typically 0.5). In Keras, I'm not seeing any way to specify this threshold, so I assume it's done implicitly in the back-end? If this is the case, how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem? With binary classification, we want a binary value, but with regression a nominal value is needed. All I can see that could be indicating this is the loss function. Is that informing Keras on how to handle the data?

Additionally, assuming Keras is implicitly applying a threshold, why does it output nominal values when I use my model to predict on new data?

For example:

y_pred = model.predict(x_test)
print(y_pred)

gives:

[7.4706882e-02] [8.3481872e-01] [2.9314638e-04] [5.2297767e-03] [2.1608515e-01] ... [4.4894204e-03] [5.1120580e-05] [7.0263929e-04]

I can apply a threshold myself when predicting to get a binary output, however surely Keras must be doing that anyway in order to correctly classify? Perhaps Keras is applying a threshold when training the model, but when I use it to predict new values, the threshold isn't used as the loss function isn't used in predicting? Or is not applying a threshold at all, and the nominal values outputted happen to be working well with my model? I've checked this is happening on the Keras example for binary classification, so I don't think I've made any errors with my code, especially as it's predicting accurately.

If anyone could explain how this is working, I would greatly appreciate it.

Here's my model as a point of reference:

model = Sequential()
model.add(Dense(124, activation='relu', input_shape = (2,)))
model.add(Dropout(0.5))
model.add(Dense(124, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(loss='binary_crossentropy',
              optimizer=SGD(lr = 0.1, momentum = 0.003),
              metrics=['acc'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)

Answer

Maxim Egorushkin picture Maxim Egorushkin · Mar 6, 2018

The output of a binary classification is the probability of a sample belonging to a class.

how is Keras distinguishing between the use of sigmoid in a binary classification problem, or a regression problem?

It does not need to. It uses the loss function to calculate the loss, then the derivatives and update the weights.

In other words:

  • During training the framework minimizes the loss. The user must specify the loss function (provided by the framework) or supply their own. The network only cares about the scalar value this function outputs and its 2 arguments are predicted y^ and actual y.
  • Each activation function implements the forward propagation and back-propagation functions. The framework is only interested in these 2 functions. It does not care what the function does exactly. The only requirement is that the activation function is non-linear.