How can I split a text into sentences?

Artyom picture Artyom · Jan 1, 2011 · Viewed 156.7k times · Source

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

Answer

Ned Batchelder picture Ned Batchelder · Jan 1, 2011

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)