(grep) Regex to match non-ASCII characters?

regex unicode grep ascii non-ascii-characters

Rory · Jan 23, 2010 · Viewed 150.5k times · Source

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

Answer

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

(grep) Regex to match non-ASCII characters?

Answer

Related questions