Python Untokenize a sentence

Brana picture Brana · Feb 22, 2014 · Viewed 29.8k times · Source

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

Edit:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')   

Answer

alecxe picture alecxe · Dec 23, 2016

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.