Understanding parameters in Gensim LDA Model

Jane Sully picture Jane Sully · Jun 11, 2018 · Viewed 10.5k times · Source

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand:

  • random_state
  • update_every
  • chunksize
  • passes
  • alpha
  • per_word_topics

I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set

  • num_topics = 10
  • random_state = 100
  • update_every = 1
  • chunksize = 50
  • passes = 10
  • alpha = 'auto'
  • per_word_topics = True

but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.

Answer

sophros picture sophros · Jun 12, 2018

I wonder if you have seen this page?

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

As for the other parameters:

  • random_state - this serves as a seed (in case you wanted to repeat exactly the training process)

  • chunksize - number of documents to consider at once (affects the memory consumption)

  • update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)

  • passes - how many times the algorithm is supposed to pass over the whole corpus

  • alpha - to cite the documentation:

    can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

  • per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

Optimal training process parameters are described particularly well in M. Hoffman et al., Online Learning for Latent Dirichlet Allocation.

For memory optimization of the training process or the model see this blog post.