Does \w match all alphanumeric characters defined in the Unicode standard?

Question 1

Does \w match all alphanumeric characters defined in the Unicode standard?

regex perl unicode internationalization character-properties

knorv · Apr 5, 2011 · Viewed 13.9k times · Source

Answer

Answer

Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.

So it looks like the answer to your question is "yes".

However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.

Question 2

Does Perl's \w match all alphanumeric characters defined in the Unicode standard?

For example, will \w match all (say) Chinese and Russian alphanumeric characters?

I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.

#!/usr/bin/perl                                                                                                                                                                                                  

use utf8;

binmode(STDOUT, ':utf8');

my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";

foreach my $ok (@ok) {
    die unless ($ok =~ /^\w+$/);
}

Does \w match all alphanumeric characters defined in the Unicode standard?

Answer

Related questions