How to use TaggedDocument in gensim?

Farhood picture Farhood · Jul 16, 2017 · Viewed 18k times · Source

I have two directories from which I want to read their text files and label them, but I don't know how to do this via TaggedDocument. I thought it would work as TaggedDocument([Strings],[Labels]) but this doesn't work apparently.

This is my code:

from gensim import models
from gensim.models.doc2vec import TaggedDocument
import utilities as util
import os
from sklearn import svm
from nltk.tokenize import sent_tokenize
CogPath = "./FixedCog/"
NotCogPath = "./FixedNotCog/"
SamplePath ="./Sample/"
docs = []
tags = []
CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]
NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]
SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]
for doc in CogList:
     str = open(CogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     print docs
     tags.append(doc)
     print "###########"
     print tags
     print "!!!!!!!!!!!"
for doc in NotCogList:
     str = open(NotCogPath+doc,'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)
for doc in SampleList:
     str = open(SamplePath + doc, 'r').read().decode("utf-8")
     docs.append(str)
     tags.append(doc)

T = TaggedDocument(docs,tags)

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

and this is the error I get:

Traceback (most recent call last):
  File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>
    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab
    if isinstance(document.words, string_types):
AttributeError: 'list' object has no attribute 'words'

Answer

Farhood picture Farhood · Jul 16, 2017

So I just experimented a bit and found this on github:

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
    """
    A single document, made up of `words` (a list of unicode string tokens)
    and `tags` (a list of tokens). Tags may be one or more unicode string
    tokens, but typical practice (which will also be most memory-efficient) is
    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

so I decided to change how I use the TaggedDocument function by generating a TaggedDocument class for each document, the important thing is that you have to pass the tags as a list.

for doc in CogList:
     str = open(CogPath+doc,'r').read().decode("utf-8")
     str_list = str.split()
     T = TaggedDocument(str_list,[doc])
     docs.append(T)