Ruby regex extracting words

Shabu picture Shabu · Nov 17, 2011 · Viewed 10k times · Source

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan

For instance, the string:

'   hello "my name" is    "Tom"'

should match the words:

hello
my name
is
Tom

I managed to match the words enclosed in double quotes by using:

/"([^\"]*)"/

but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.

Any help with this would be appreciated!

Answer

Narendra Yadala picture Narendra Yadala · Nov 17, 2011
result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. It will print

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings.

Explanation

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints

=> ["hello", "\"my name\"", "is", "\"Tom\""]