Python Regex for hyphenated words

Sixhobbits picture Sixhobbits · Dec 5, 2011 · Viewed 22.7k times · Source

I'm looking for a regex to match hyphenated words in python.

The closest I've managed to get is: '\w+-\w+[-w+]*'

text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)

which returns list ['one-hundered-and-three-', 'foo-bar'].

This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.

Answer

a'r picture a'r · Dec 5, 2011

Try this:

re.findall(r'\w+(?:-\w+)+',text)

Here we consider a hyphenated word to be:

  • a number of word chars
  • followed by any number of:
    • a single hyphen
    • followed by word chars