I think what I want to do is a fairly common task but I've found no reference on the web. I have text with punctuation, and I want a list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split()
only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?
re.split(pattern, string[, maxsplit=0])
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']