Use LSTM tutorial code to predict next word in a sentence?

Darren Cook picture Darren Cook · Sep 8, 2017 · Viewed 7.6k times · Source

I've been trying to understand the sample code with https://www.tensorflow.org/tutorials/recurrent which you can find at https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py

(Using tensorflow 1.3.0.)

I've summarized (what I think are) the key parts, for my question, below:

 size = 200
 vocab_size = 10000
 layers = 2
 # input_.input_data is a 2D tensor [batch_size, num_steps] of
 #    word ids, from 1 to 10000

 cell = tf.contrib.rnn.MultiRNNCell(
    [tf.contrib.rnn.BasicLSTMCell(size) for _ in range(2)]
    )

 embedding = tf.get_variable(
      "embedding", [vocab_size, size], dtype=tf.float32)
 inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

inputs = tf.unstack(inputs, num=num_steps, axis=1)
outputs, state = tf.contrib.rnn.static_rnn(
    cell, inputs, initial_state=self._initial_state)

output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size])
softmax_w = tf.get_variable(
    "softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b

# Then calculate loss, do gradient descent, etc.

My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence? Concretely, I imagine the flow is like this, but I cannot get my head around what the code for the commented lines would be:

prefix = ["What", "is", "your"]
state = #Zeroes
# Call static_rnn(cell) once for each word in prefix to initialize state
# Use final output to set a string, next_word
print(next_word)

My sub-questions are:

  • Why use a random (uninitialized, untrained) word-embedding?
  • Why use softmax?
  • Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
  • How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

(I'm asking them all as one question, as I suspect they are all connected, and connected to some gap in my understanding.)

What I was expecting to see here was loading an existing word2vec set of word embeddings (e.g. using gensim's KeyedVectors.load_word2vec_format()), convert each word in the input corpus to that representation when loading in each sentence, and then afterwards the LSTM would spit out a vector of the same dimension, and we would try and find the most similar word (e.g. using gensim's similar_by_vector(y, topn=1)).

Is using softmax saving us from the relatively slow similar_by_vector(y, topn=1) call?


BTW, for the pre-existing word2vec part of my question Using pre-trained word2vec with LSTM for word generation is similar. However the answers there, currently, are not what I'm looking for. What I'm hoping for is a plain English explanation that switches the light on for me, and plugs whatever the gap in my understanding is.  Use pre-trained word2vec in lstm language model? is another similar question.

UPDATE: Predicting next word using the language model tensorflow example and Predicting the next word using the LSTM ptb model tensorflow example are similar questions. However, neither shows the code to actually take the first few words of a sentence, and print out its prediction of the next word. I tried pasting in code from the 2nd question, and from https://stackoverflow.com/a/39282697/841830 (which comes with a github branch), but cannot get either to run without errors. I think they may be for an earlier version of TensorFlow?

ANOTHER UPDATE: Yet another question asking basically the same thing: Predicting Next Word of LSTM Model from Tensorflow Example It links to Predicting next word using the language model tensorflow example (and, again, the answers there are not quite what I am looking for).

In case it still isn't clear, what I am trying to write a high-level function called getNextWord(model, sentencePrefix), where model is a previously built LSTM that I've loaded from disk, and sentencePrefix is a string, such as "Open the", and it might return "pod". I then might call it with "Open the pod" and it will return "bay", and so on.

An example (with a character RNN, and using mxnet) is the sample() function shown near the end of https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter05_recurrent-neural-networks/simple-rnn.ipynb You can call sample() during training, but you can also call it after training, and with any sentence you want.

Answer

c2huc2hu picture c2huc2hu · Sep 11, 2017

Main Question

Loading words

Load custom data instead of using the test set:

reader.py@ptb_raw_data

test_path = os.path.join(data_path, "ptb.test.txt")
test_data = _file_to_word_ids(test_path, word_to_id)  # change this line

test_data should contain word ids (print out word_to_id for a mapping). As an example, it should look like: [1, 52, 562, 246] ...

Displaying predictions

We need to return the output of the FC layer (logits) in the call to sess.run

ptb_word_lm.py@PTBModel.__init__

    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
    self.top_word_id = tf.argmax(logits, axis=2)  # add this line

ptb_word_lm.py@run_epoch

  fetches = {
      "cost": model.cost,
      "final_state": model.final_state,
      "top_word_id": model.top_word_id # add this line
  }

Later in the function, vals['top_word_id'] will have an array of integers with the ID of the top word. Look this up in word_to_id to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.

Subquestions

Why use a random (uninitialized, untrained) word-embedding?

You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.

Why use softmax?

The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.

Here's a sketch of the operations and dimensions in the neural net:

word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.

LSTM equations

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

I don't know, sorry.