Split sentence into words but having trouble with the punctuations in C#

Richard N picture Richard N · Sep 5, 2011 · Viewed 31.4k times · Source

I have seen a few similar questions but I am trying to achieve this.

Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.

string[] words = Regex.Split(line, @"\W+");

Would surely appreciate some nudges in the right direction.

Answer

TheCodeKing picture TheCodeKing · Sep 5, 2011

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth