I found several topics of this and I found this solution:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
This should remove every punctuation except ', the problem is it also strips everything else from the sentence.
Example:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
of course what I want is to keep the sentence without punctuation, and "warhol's" stays as is
Desired output:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
Edit: I also tried using
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
but this strips every punctuation
Specify all the elements you don't want removed, i.e. \w
, \d
, \s
, etc. This is what the ^
operator means with in square brackets. (matches anything except)
>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>