Tokenizing strings using regular expression in Javascript

Nawaz picture Nawaz · Dec 9, 2011 · Viewed 19.2k times · Source

Suppose I've a long string containing newlines and tabs as:

var x = "This is a long string.\n\t This is another one on next line.";

So how can we split this string into tokens, using regular expression?

I don't want to use .split(' ') because I want to learn Javascript's Regex.

A more complicated string could be this:

var y = "This @is a #long $string. Alright, lets split this.";

Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];

Answer

Alexander Yezutov picture Alexander Yezutov · Dec 9, 2011

Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/

Basically, the code is very simple:

var y = "This @is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"

var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
    document.write(match[i]);
    document.write('<br>');
}

UPDATE: Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/

var regex = /[^\s\.,!?]+/g;

UPDATE 2: Only letters all the time: http://jsfiddle.net/ayezutov/BjXw5/3/

var regex = /\w+/g;