Gradient descent stochastic update - Stopping criterion and update rule - Machine Learning

code muncher picture code muncher · Jan 22, 2013 · Viewed 7.2k times · Source

My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.

The update algorithm is shown below:

for i = 1 ... n data:
    for t = 1 ... m features:
         w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t

where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.

My questions:

  • Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?

  • I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.

  • With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).

Answer

Ammar picture Ammar · Jan 23, 2013

Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?

This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.

I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.

There are a few options. I recommend trying as many of them and deciding based on empirical results.

  1. difference in the objective function for the training data is smaller than a threshold.
  2. difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
  3. the total absolute difference in parameters w is smaller than a threshold.
  4. in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
  5. sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.

With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).

It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.