Trying to remove non-printable characters (junk values) from a UNIX file

Pranav picture Pranav · Dec 22, 2015 · Viewed 23.1k times · Source

I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed. Also I tried using

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE 

but it also did not help.

Can anybody suggest some alternative way to remove non-printable characters?

Used tr -cd but it is removing accented characters. But they are required in the file.

Answer

Tom Fenech picture Tom Fenech · Dec 22, 2015

Perhaps you could go with the complement of [:print:], which contains all printable characters:

tr -cd '[:print:]' < file > newfile

If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

sed 's/[^[:print:]]//g' file