Predicting the next word using the LSTM ptb model tensorflow example

rebeccaroisin picture rebeccaroisin · Mar 29, 2016 · Viewed 8.2k times · Source

I am trying to use the tensorflow LSTM model to make next word predictions.

As described in this related question (which has no accepted answer) the example contains pseudocode to extract next word probabilities:

lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])

loss = 0.0
for current_batch_of_words in words_in_dataset:
  # The value of state is updated after processing each batch of words.
  output, state = lstm(current_batch_of_words, state)

  # The LSTM output can be used to make next word predictions
  logits = tf.matmul(output, softmax_w) + softmax_b
  probabilities = tf.nn.softmax(logits)
  loss += loss_function(probabilities, target_words)

I am confused about how to interpret the probabilities vector. I modified the __init__ function of the PTBModel in ptb_word_lm.py to store the probabilities and logits:

class PTBModel(object):
  """The PTB model."""

  def __init__(self, is_training, config):
    # General definition of LSTM (unrolled)
    # identical to tensorflow example ...     
    # omitted for brevity ...


    # computing the logits (also from example code)
    logits = tf.nn.xw_plus_b(output,
                             tf.get_variable("softmax_w", [size, vocab_size]),
                             tf.get_variable("softmax_b", [vocab_size]))
    loss = seq2seq.sequence_loss_by_example([logits],
                                            [tf.reshape(self._targets, [-1])],
                                            [tf.ones([batch_size * num_steps])],
                                            vocab_size)
    self._cost = cost = tf.reduce_sum(loss) / batch_size
    self._final_state = states[-1]

    # my addition: storing the probabilities and logits
    self.probabilities = tf.nn.softmax(logits)
    self.logits = logits

    # more model definition ...

Then printed some info about them in the run_epoch function:

def run_epoch(session, m, data, eval_op, verbose=True):
  """Runs the model on the given data."""
  # first part of function unchanged from example

  for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
                                                    m.num_steps)):
    # evaluate proobability and logit tensors too:
    cost, state, probs, logits, _ = session.run([m.cost, m.final_state, m.probabilities, m.logits, eval_op],
                                 {m.input_data: x,
                                  m.targets: y,
                                  m.initial_state: state})
    costs += cost
    iters += m.num_steps

    if verbose and step % (epoch_size // 10) == 10:
      print("%.3f perplexity: %.3f speed: %.0f wps, n_iters: %s" %
            (step * 1.0 / epoch_size, np.exp(costs / iters),
             iters * m.batch_size / (time.time() - start_time), iters))
      chosen_word = np.argmax(probs, 1)
      print("Probabilities shape: %s, Logits shape: %s" % 
            (probs.shape, logits.shape) )
      print(chosen_word)
      print("Batch size: %s, Num steps: %s" % (m.batch_size, m.num_steps))

  return np.exp(costs / iters)

This produces output like this:

0.000 perplexity: 741.577 speed: 230 wps, n_iters: 220
(20, 10000) (20, 10000)
[ 14   1   6 589   1   5   0  87   6   5   3   5   2   2   2   2   6   2  6   1]
Batch size: 1, Num steps: 20

I was expecting the probs vector to be an array of probabilities, with one for each word in the vocabulary (eg with shape (1, vocab_size)), meaning that I could get the predicted word using np.argmax(probs, 1) as suggested in the other question.

However, the first dimension of the vector is actually equal to the number of steps in the unrolled LSTM (20 if the small config settings are used), which I'm not sure what to do with. To access to the predicted word, do I just need to use the last value (because it's the output of the final step)? Or is there something else that I'm missing?

I tried to understand how the predictions are made and evaluated by looking at the implementation of seq2seq.sequence_loss_by_example, which must perform this evaluation, but this ends up calling gen_nn_ops._sparse_softmax_cross_entropy_with_logits, which doesn't seem to be included in the github repo, so I'm not sure where else to look.

I'm quite new to both tensorflow and LSTMs, so any help is appreciated!

Answer

mrry picture mrry · Mar 29, 2016

The output tensor contains the concatentation of the LSTM cell outputs for each timestep (see its definition here). Therefore you can find the prediction for the next word by taking chosen_word[-1] (or chosen_word[sequence_length - 1] if the sequence has been padded to match the unrolled LSTM).

The tf.nn.sparse_softmax_cross_entropy_with_logits() op is documented in the public API under a different name. For technical reasons, it calls a generated wrapper function that does not appear in the GitHub repository. The implementation of the op is in C++, here.