open file and read sentence

user3119123 picture user3119123 · Dec 21, 2013 · Viewed 9.8k times · Source

I want to open a file and get sentences. The sentences in the file go across lines, like this:

"He said, 'I'll pay you five pounds a week if I can have it on my own
terms.'  I'm a poor woman, sir, and Mr. Warren earns little, and the
money meant much to me.  He took out a ten-pound note, and he held it
out to me then and there. 

currently I'm using this code:

text = ' '.join(file_to_open.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

readlines cuts through the sentences, is there a good way to solve this to get only the sentences? (without NLTK)

Thanks for you attention.

The current problem:

file_to_read = 'test.txt'

with open(file_to_read) as f:
    text = f.read()

import re
word_list = ['Mrs.', 'Mr.']     

for i in word_list:
    text = re.sub(i, i[:-1], text)

What I get back ( in the test case) is that Mrs. changed to Mr while Mr. is just Mr . I tried several other things, but don't seem to work. Answer is probably easy but I'm missing it

Answer

Totem picture Totem · Dec 21, 2013

Your regex works on the text above if you do this:

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The only problem is, the regex splits on the dot in "Mr." from your text above, so you need to fix/change that.

One solution to this, though not perfect, is you could take out all occurences of a dot after Mr:

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

this Matches an 'M' followed by minimum 1, maximum 2 alphanumeric chars(\w{1,3}), followed by a dot. The parenthesised part of the pattern is grouped and captured, and it's referenced in the replacement as '\1'(or group 1, as you could have more parenthesised groups). So essentially, the Mr. or Mrs. is matched, but only the Mr or Mrs part is captured, and the Mr. or Mrs. is then replaced by the captured part which excludes the dot.

and then :

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

will work the way you want.