I am using python2.7, nltk 3.2.1 and python-crfsuite 0.8.4. I am following this page : http://www.nltk.org/api/nltk.tag.html?highlight=stanford#nltk.tag.stanford.NERTagger for nltk.tag.crf module.
To start with i just run this
from nltk.tag import CRFTagger
ct = CRFTagger()
train_data = [[('dfd','dfd')]]
ct.train(train_data,"abc")
I tried this too
f = open("abc","wb")
ct.train(train_data,f)
but i am getting the following error,
File "C:\Python27\lib\site-packages\nltk\tag\crf.py", line 129, in <genexpr>
if all (unicodedata.category(x) in punc_cat for x in token):
TypeError: must be unicode, not str
In Python 2, regular quotes '...'
or "..."
create byte strings. To get Unicode strings, use a u
prefix before the string, like u'dfd'
.
To read from a file, you'll want to specify an encoding. See Backporting Python 3 open(encoding="utf-8")
to Python 2 for options; most straightforwardly, replace open()
with io.open()
.
To convert an existing string, use the unicode()
method; though usually, you'll want to use decode()
and supply an encoding, too.
For (much) more details, Ned Batchelder's "Pragmatic Unicode" slides are recommended, if not outright obligatory reading; http://nedbatchelder.com/text/unipain.html