Why does \w match only English words in javascript regex?

Doron Yaacoby picture Doron Yaacoby · Dec 29, 2008 · Viewed 8.4k times · Source

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).

So what can I use instead of \w to match all letters in all languages?

Answer

David Koelle picture David Koelle · Dec 29, 2008

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.